Course 4: Machine Learning for Financial Markets
Course Overview
| Duration | Modules | Exercises | Level |
|---|---|---|---|
| ~45 hours | 14 + Capstone | ~95 | Intermediate to Advanced |
What You'll Learn
- Apply machine learning to predict market movements
- Build robust feature engineering pipelines
- Train and evaluate classification and regression models
- Analyze sentiment from news and social media
- Deploy production ML systems with monitoring
- Avoid common pitfalls unique to financial ML
Course Structure
Part 1: ML Fundamentals for Finance
Build the foundation for financial machine learning.
| Module | Title | Key Topics |
|---|---|---|
| 1 | ML Concepts for Traders | Supervised/unsupervised, why finance is different |
| 2 | Data Preparation | Time series splits, cross-validation, imbalanced data |
| 3 | Feature Engineering | Price features, technical indicators, statistical features |
| 4 | Target Engineering | Triple barrier method, meta-labeling, lookahead bias |
Part 2: Classification Models
Master the models used for direction prediction.
| Module | Title | Key Topics |
|---|---|---|
| 5 | Tree-Based Models | Decision trees, Random Forest, XGBoost, LightGBM |
| 6 | Other Classification Models | Logistic regression, SVM, neural networks |
| 7 | Model Evaluation | Financial metrics, confusion matrix, ROC curves |
Part 3: Advanced Techniques
Expand beyond classification into specialized domains.
| Module | Title | Key Topics |
|---|---|---|
| 8 | Regression Models | Return prediction, volatility forecasting, quantile regression |
| 9 | Sentiment Analysis | Text processing, sentiment scoring, news signals |
| 10 | Alternative Data | Web scraping, social media, multi-source features |
Part 4: Deep Learning & Production
Deploy models to production with proper infrastructure.
| Module | Title | Key Topics |
|---|---|---|
| 11 | Deep Learning for Finance | Neural networks, LSTM, transformers |
| 12 | Backtesting ML Strategies | Walk-forward optimization, avoiding pitfalls |
| 13 | Production ML Systems | Model deployment, feature pipelines, monitoring |
| 14 | Advanced ML Topics | Reinforcement learning, ensembles, online learning |
Why Financial ML is Different
Machine learning in finance faces unique challenges that don't exist in other domains:
# Financial ML Challenges
challenges = {
'Low Signal-to-Noise': {
'description': 'Financial data is extremely noisy',
'implication': 'Models easily overfit to noise instead of signal',
'solution': 'Robust validation, regularization, feature selection'
},
'Non-Stationarity': {
'description': 'Market dynamics change over time',
'implication': 'Models trained on past data may not work on future data',
'solution': 'Walk-forward validation, adaptive models, regime detection'
},
'Regime Changes': {
'description': 'Markets shift between bull/bear/sideways regimes',
'implication': 'A model that works in one regime may fail in another',
'solution': 'Regime-aware models, ensemble approaches'
},
'Adversarial Environment': {
'description': 'Other traders adapt to profitable strategies',
'implication': 'Alpha decays as strategies become crowded',
'solution': 'Continuous innovation, unique data sources'
},
'Lookahead Bias': {
'description': 'Easy to accidentally use future information',
'implication': 'Backtests look great but live trading fails',
'solution': 'Strict point-in-time data, purging, embargo'
}
}
for challenge, details in challenges.items():
print(f"\n{challenge}")
print(f" Problem: {details['description']}")
print(f" Risk: {details['implication']}")
print(f" Solution: {details['solution']}")
The ML Pipeline for Trading
# The Financial ML Pipeline
pipeline_stages = """
┌─────────────────────────────────────────────────────────────────────────────┐
│ FINANCIAL ML PIPELINE │
└─────────────────────────────────────────────────────────────────────────────┘
1. DATA COLLECTION
├── Price data (OHLCV)
├── Fundamental data
├── Alternative data (news, social, satellite)
└── Point-in-time considerations
│
▼
2. DATA PREPARATION
├── Handle missing data
├── Adjust for corporate actions
├── Time series train/test split
└── Avoid lookahead bias
│
▼
3. FEATURE ENGINEERING
├── Price-based features (returns, volatility)
├── Technical indicators
├── Statistical features (z-scores, percentiles)
└── Feature selection
│
▼
4. TARGET ENGINEERING
├── Define prediction target
├── Triple barrier method
├── Meta-labeling
└── Sample weighting
│
▼
5. MODEL TRAINING
├── Select algorithm(s)
├── Hyperparameter tuning
├── Cross-validation (time series aware)
└── Ensemble methods
│
▼
6. EVALUATION
├── ML metrics (accuracy, F1, AUC)
├── Financial metrics (Sharpe, returns)
├── Walk-forward testing
└── Statistical significance
│
▼
7. DEPLOYMENT
├── Feature pipeline
├── Real-time prediction
├── Model monitoring
└── Retraining triggers
"""
print(pipeline_stages)
Key Libraries
# Core libraries used throughout this course
# Data manipulation
import pandas as pd
import numpy as np
# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Machine Learning
from sklearn.model_selection import TimeSeriesSplit
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
# Financial data
import yfinance as yf
print("Core libraries imported successfully!")
print(f"\nVersions:")
print(f" pandas: {pd.__version__}")
print(f" numpy: {np.__version__}")
Quick Preview: A Simple ML Trading Model
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import accuracy_score, classification_report
import yfinance as yf
import warnings
warnings.filterwarnings('ignore')
# 1. Get data
print("1. Fetching data...")
ticker = yf.Ticker("SPY")
df = ticker.history(period="2y")
print(f" Downloaded {len(df)} days of SPY data")
# 2. Create features
print("\n2. Engineering features...")
df['returns'] = df['Close'].pct_change()
df['sma_5'] = df['Close'].rolling(5).mean()
df['sma_20'] = df['Close'].rolling(20).mean()
df['volatility'] = df['returns'].rolling(20).std()
df['momentum'] = df['Close'].pct_change(10)
# Feature: distance from moving average
df['dist_sma_5'] = (df['Close'] - df['sma_5']) / df['sma_5']
df['dist_sma_20'] = (df['Close'] - df['sma_20']) / df['sma_20']
# 3. Create target (next day direction)
print("3. Creating target labels...")
df['target'] = (df['returns'].shift(-1) > 0).astype(int)
# 4. Prepare data
features = ['dist_sma_5', 'dist_sma_20', 'volatility', 'momentum']
df_clean = df.dropna()
X = df_clean[features]
y = df_clean['target']
# 5. Time series split (respects temporal order)
print("\n4. Splitting data (time series aware)...")
split_idx = int(len(X) * 0.8)
X_train, X_test = X[:split_idx], X[split_idx:]
y_train, y_test = y[:split_idx], y[split_idx:]
print(f" Training: {len(X_train)} samples")
print(f" Testing: {len(X_test)} samples")
# 6. Train model
print("\n5. Training Random Forest...")
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
model.fit(X_train, y_train)
# 7. Evaluate
print("\n6. Evaluating model...")
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"\n Accuracy: {accuracy:.2%}")
print(f" (Random baseline: 50%)")
# Feature importance
print("\n7. Feature Importance:")
for feat, imp in sorted(zip(features, model.feature_importances_), key=lambda x: -x[1]):
print(f" {feat}: {imp:.3f}")
Prerequisites Check
Before starting this course, ensure you're comfortable with:
# Prerequisite skills check
prerequisites = {
'Python Fundamentals': [
'Variables and data types',
'Functions and classes',
'List comprehensions',
'File I/O'
],
'Pandas & NumPy': [
'DataFrames and Series',
'Indexing and selection',
'GroupBy operations',
'Vectorized operations'
],
'Basic Statistics': [
'Mean, variance, standard deviation',
'Correlation and covariance',
'Normal distribution',
'Hypothesis testing basics'
],
'Financial Concepts': [
'Returns calculation',
'Risk metrics (volatility)',
'Technical indicators basics',
'Market order types'
]
}
print("Prerequisites for this course:\n")
for category, skills in prerequisites.items():
print(f"{category}:")
for skill in skills:
print(f" - {skill}")
print()
Capstone Preview
By the end of this course, you'll build a Production ML Trading System that includes:
- Multi-source data pipeline (price + sentiment + alternative)
- Feature engineering library with 50+ features
- Multiple model comparison (tree-based + neural)
- Proper walk-forward validation
- Sentiment integration from news
- Model interpretation & explainability (SHAP)
- Production deployment with monitoring
- Automated retraining pipeline
Let's Begin!
Start with Module 1: ML Concepts for Traders to understand why machine learning in finance requires special considerations.
Next: Module 1 - ML Concepts for Traders
Module 1: ML Concepts for Traders
Part 1: ML Fundamentals for Finance
| Duration | Exercises |
|---|---|
| ~2.5 hours | 6 |
Learning Objectives
By the end of this module, you will be able to:
- Distinguish between supervised and unsupervised learning
- Understand why financial ML faces unique challenges
- Map the complete ML pipeline for trading applications
- Set up your ML development environment
1.1 What is Machine Learning?
Machine learning is about building systems that learn patterns from data rather than being explicitly programmed. In trading, we use ML to find patterns that might predict future price movements.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Traditional programming vs Machine Learning
print("Traditional Programming:")
print(" Input: Data + Rules")
print(" Output: Answers")
print(" Example: IF price > SMA(20) THEN buy")
print("\nMachine Learning:")
print(" Input: Data + Answers (historical examples)")
print(" Output: Rules (learned patterns)")
print(" Example: Model learns what conditions precede profitable trades")
Supervised vs Unsupervised Learning
# Supervised Learning: We have labeled examples
# - Classification: Predict categories (up/down, buy/sell/hold)
# - Regression: Predict continuous values (tomorrow's return)
supervised_examples = {
'Classification': [
'Predict if stock goes up or down tomorrow',
'Classify trades as profitable or not',
'Detect market regime (bull/bear/sideways)'
],
'Regression': [
'Predict tomorrow\'s return magnitude',
'Forecast volatility',
'Estimate fair value'
]
}
# Unsupervised Learning: No labels, find structure in data
unsupervised_examples = {
'Clustering': [
'Group similar stocks together',
'Identify market regimes',
'Segment trading days by behavior'
],
'Dimensionality Reduction': [
'Reduce many correlated features to few factors',
'Find latent market factors',
'Compress feature space'
]
}
print("SUPERVISED LEARNING (with labels):")
for category, examples in supervised_examples.items():
print(f"\n {category}:")
for ex in examples:
print(f" - {ex}")
print("\n" + "="*50)
print("\nUNSUPERVISED LEARNING (no labels):")
for category, examples in unsupervised_examples.items():
print(f"\n {category}:")
for ex in examples:
print(f" - {ex}")
Training, Validation, and Testing
# The fundamental concept: Split your data
print("Data Split Strategy:")
print("="*50)
print("\n1. TRAINING SET (~60-70%)")
print(" - Model learns patterns from this data")
print(" - Like studying for an exam")
print("\n2. VALIDATION SET (~15-20%)")
print(" - Used to tune hyperparameters")
print(" - Like practice tests")
print("\n3. TEST SET (~15-20%)")
print(" - Final evaluation, NEVER used during training")
print(" - Like the final exam")
# Visual representation
fig, ax = plt.subplots(figsize=(12, 2))
# Draw the splits
ax.barh(0, 0.7, left=0, color='steelblue', label='Training (70%)')
ax.barh(0, 0.15, left=0.7, color='orange', label='Validation (15%)')
ax.barh(0, 0.15, left=0.85, color='green', label='Test (15%)')
ax.set_xlim(0, 1)
ax.set_ylim(-0.5, 0.5)
ax.set_yticks([])
ax.set_xlabel('Data Timeline')
ax.legend(loc='upper center', bbox_to_anchor=(0.5, 1.5), ncol=3)
ax.set_title('Standard Data Split for Time Series')
plt.tight_layout()
plt.show()
1.2 Why Finance is Different
Financial markets present unique challenges that make ML much harder than in other domains.
# Challenge 1: Low Signal-to-Noise Ratio
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(42)
# Simulate a signal buried in noise
days = 252
signal = np.sin(np.linspace(0, 4*np.pi, days)) * 0.01 # Tiny signal
noise = np.random.normal(0, 0.02, days) # Much larger noise
observed_returns = signal + noise
# Calculate signal-to-noise ratio
snr = np.std(signal) / np.std(noise)
print(f"Signal-to-Noise Ratio: {snr:.2%}")
print("The true signal is only ~50% as strong as the noise!")
fig, axes = plt.subplots(3, 1, figsize=(12, 8))
axes[0].plot(signal, 'g-', linewidth=2)
axes[0].set_title('True Signal (Hidden)')
axes[0].set_ylabel('Return')
axes[1].plot(noise, 'r-', alpha=0.7)
axes[1].set_title('Noise')
axes[1].set_ylabel('Return')
axes[2].plot(observed_returns, 'b-', alpha=0.7)
axes[2].set_title('What We Observe (Signal + Noise)')
axes[2].set_xlabel('Days')
axes[2].set_ylabel('Return')
plt.tight_layout()
plt.show()
print("\nImplication: Models easily fit to noise, not signal")
print("Solution: Regularization, cross-validation, feature selection")
# Challenge 2: Non-Stationarity
# Financial time series properties change over time
np.random.seed(42)
# Simulate changing volatility regimes
low_vol = np.random.normal(0.001, 0.01, 100)
high_vol = np.random.normal(-0.002, 0.03, 100)
medium_vol = np.random.normal(0.0005, 0.015, 100)
returns = np.concatenate([low_vol, high_vol, medium_vol])
fig, axes = plt.subplots(2, 1, figsize=(12, 6))
# Returns
axes[0].plot(returns)
axes[0].axvline(100, color='r', linestyle='--', alpha=0.7)
axes[0].axvline(200, color='r', linestyle='--', alpha=0.7)
axes[0].set_title('Returns with Changing Regimes')
axes[0].set_ylabel('Return')
# Rolling volatility
rolling_vol = pd.Series(returns).rolling(20).std()
axes[1].plot(rolling_vol, color='orange')
axes[1].axvline(100, color='r', linestyle='--', alpha=0.7)
axes[1].axvline(200, color='r', linestyle='--', alpha=0.7)
axes[1].set_title('Rolling Volatility (20-day)')
axes[1].set_xlabel('Days')
axes[1].set_ylabel('Volatility')
plt.tight_layout()
plt.show()
print("The same strategy that works in one regime may fail in another.")
print("A model trained on low-vol data will struggle in high-vol periods.")
# Challenge 3: Adversarial Environment
print("The Adversarial Nature of Markets")
print("="*50)
print("""
Unlike image classification or language translation:
1. OTHER TRADERS ARE ADAPTING
- If you find a profitable pattern, others will too
- As more money exploits a pattern, the edge disappears
- "Alpha decay" - strategies lose effectiveness over time
2. THE SYSTEM FIGHTS BACK
- Market makers adjust to trading patterns
- Large trades move prices against you
- Information gets priced in faster
3. REGIME CHANGES ARE STRUCTURAL
- Regulations change market behavior
- New instruments (ETFs, derivatives) change dynamics
- Technology (HFT) changes market microstructure
This is fundamentally different from:
- Cats don't evolve to avoid being classified as cats
- Weather patterns don't adapt to forecasts
- Medical diagnoses don't change because you're predicting them
""")
# Challenge 4: Lookahead Bias - The Silent Killer
print("Lookahead Bias Examples")
print("="*50)
lookahead_examples = [
{
'mistake': 'Using adjusted close prices for historical signals',
'problem': 'Adjustments happen after splits/dividends occur',
'solution': 'Use unadjusted prices, apply adjustments at signal time'
},
{
'mistake': 'Including future data in feature calculation',
'problem': 'Rolling window includes today\'s close in today\'s signal',
'solution': 'Use .shift(1) to lag features appropriately'
},
{
'mistake': 'Survivorship bias in stock universe',
'problem': 'Only including stocks that exist today',
'solution': 'Use point-in-time constituent lists'
},
{
'mistake': 'Using final earnings numbers',
'problem': 'Earnings are often revised after initial release',
'solution': 'Use point-in-time fundamental data'
}
]
for i, example in enumerate(lookahead_examples, 1):
print(f"\n{i}. {example['mistake']}")
print(f" Problem: {example['problem']}")
print(f" Solution: {example['solution']}")
Exercise 1.1: Identify ML Problem Types (Guided)
Classify financial ML problems as supervised/unsupervised and classification/regression.
Solution 1.1
def classify_ml_problem(description: str) -> dict:
"""
Classify an ML problem based on its description.
Args:
description: Description of the ML problem
Returns:
Dictionary with learning_type and task_type
"""
description_lower = description.lower()
# Determine if supervised or unsupervised
supervised_keywords = ['predict', 'forecast', 'classify', 'whether', 'will']
is_supervised = any(kw in description_lower for kw in supervised_keywords)
# Determine task type
classification_keywords = ['up', 'down', 'category', 'direction', 'whether']
regression_keywords = ['how much', 'value', 'return', 'price', 'amount']
is_classification = any(kw in description_lower for kw in classification_keywords)
is_regression = any(kw in description_lower for kw in regression_keywords)
# Build result
result = {
'learning_type': 'supervised' if is_supervised else 'unsupervised',
'task_type': 'unknown'
}
if is_supervised:
if is_classification and not is_regression:
result['task_type'] = 'classification'
elif is_regression and not is_classification:
result['task_type'] = 'regression'
else:
result['task_type'] = 'could be either'
else:
result['task_type'] = 'clustering or dimensionality reduction'
return result
1.3 The ML Pipeline
A systematic approach to building ML models for trading.
# The Complete ML Pipeline for Trading
class MLPipelineStage:
"""Represents a stage in the ML pipeline."""
def __init__(self, name: str, description: str, key_considerations: list):
self.name = name
self.description = description
self.key_considerations = key_considerations
def display(self):
print(f"\n{'='*60}")
print(f"STAGE: {self.name}")
print(f"{'='*60}")
print(f"\n{self.description}")
print("\nKey Considerations:")
for consideration in self.key_considerations:
print(f" - {consideration}")
# Define pipeline stages
pipeline = [
MLPipelineStage(
"1. Data Collection",
"Gather all relevant data sources for your trading strategy.",
[
"Price data (OHLCV) at appropriate frequency",
"Fundamental data (earnings, ratios)",
"Alternative data (sentiment, satellite)",
"Ensure point-in-time accuracy"
]
),
MLPipelineStage(
"2. Data Preparation",
"Clean and prepare data for ML consumption.",
[
"Handle missing values appropriately",
"Adjust for corporate actions",
"Time series train/test split (no random shuffle!)",
"Check for lookahead bias"
]
),
MLPipelineStage(
"3. Feature Engineering",
"Create predictive features from raw data.",
[
"Price-based features (returns, volatility)",
"Technical indicators",
"Statistical features (z-scores, percentiles)",
"Feature selection to avoid overfitting"
]
),
MLPipelineStage(
"4. Target Engineering",
"Define what you're trying to predict.",
[
"Return-based vs direction-based targets",
"Triple barrier method for labeling",
"Handle overlapping labels",
"Sample weighting for uniqueness"
]
),
MLPipelineStage(
"5. Model Training",
"Train and tune your ML model.",
[
"Select appropriate algorithm",
"Hyperparameter tuning with CV",
"Use time series cross-validation",
"Ensemble methods for robustness"
]
),
MLPipelineStage(
"6. Evaluation",
"Assess model performance rigorously.",
[
"ML metrics (accuracy, precision, recall)",
"Financial metrics (Sharpe, returns)",
"Walk-forward validation",
"Statistical significance testing"
]
),
MLPipelineStage(
"7. Deployment",
"Put the model into production.",
[
"Real-time feature calculation",
"Model serving infrastructure",
"Monitoring and alerting",
"Retraining schedule"
]
)
]
# Display all stages
for stage in pipeline:
stage.display()
# Mini Example: Complete Pipeline Demo
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import yfinance as yf
print("Mini Pipeline Demo")
print("="*50)
# Stage 1: Data Collection
print("\n1. Collecting data...")
df = yf.Ticker("AAPL").history(period="2y")
print(f" Downloaded {len(df)} rows")
# Stage 2: Data Preparation
print("\n2. Preparing data...")
df = df[['Open', 'High', 'Low', 'Close', 'Volume']].copy()
print(f" Columns: {list(df.columns)}")
# Stage 3: Feature Engineering
print("\n3. Engineering features...")
df['returns'] = df['Close'].pct_change()
df['volatility'] = df['returns'].rolling(20).std()
df['momentum'] = df['Close'].pct_change(10)
df['volume_change'] = df['Volume'].pct_change()
print(f" Created 4 features")
# Stage 4: Target Engineering
print("\n4. Creating target...")
df['target'] = (df['returns'].shift(-1) > 0).astype(int)
print(f" Target: Next day direction (0=down, 1=up)")
# Prepare final dataset
features = ['returns', 'volatility', 'momentum', 'volume_change']
df_clean = df.dropna()
# Stage 5: Model Training (with time series split)
print("\n5. Training model...")
split_idx = int(len(df_clean) * 0.8)
X_train = df_clean[features][:split_idx]
y_train = df_clean['target'][:split_idx]
X_test = df_clean[features][split_idx:]
y_test = df_clean['target'][split_idx:]
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
model.fit(X_train, y_train)
print(f" Trained on {len(X_train)} samples")
# Stage 6: Evaluation
print("\n6. Evaluating model...")
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f" Test accuracy: {accuracy:.2%}")
print(f" Baseline (random): 50%")
# Stage 7: Deployment (just a preview)
print("\n7. Ready for deployment!")
print(f" Model can predict direction based on 4 features")
Exercise 1.2: Pipeline Stage Matcher (Guided)
Match activities to the correct pipeline stage.
Solution 1.2
def match_to_pipeline_stage(activity: str) -> str:
"""
Match an activity to the correct ML pipeline stage.
Args:
activity: Description of the activity
Returns:
Name of the pipeline stage
"""
activity_lower = activity.lower()
# Define keywords for each stage
stage_keywords = {
'data_collection': ['download', 'fetch', 'api', 'source', 'gather'],
'data_preparation': ['clean', 'missing', 'split', 'adjust', 'outlier'],
'feature_engineering': ['indicator', 'feature', 'rolling', 'calculate', 'transform'],
'target_engineering': ['label', 'target', 'predict what', 'direction', 'barrier'],
'model_training': ['train', 'fit', 'hyperparameter', 'tune', 'algorithm'],
'evaluation': ['accuracy', 'sharpe', 'test', 'metric', 'performance'],
'deployment': ['production', 'real-time', 'monitor', 'serve', 'deploy']
}
# Find the matching stage
for stage, keywords in stage_keywords.items():
if any(kw in activity_lower for kw in keywords):
return stage.replace('_', ' ').title()
return 'Unknown Stage'
1.4 Tools Setup
Setting up your ML development environment for financial applications.
# Core ML Libraries
print("Essential Libraries for Financial ML")
print("="*50)
libraries = {
'Data Manipulation': {
'pandas': 'DataFrames for tabular data',
'numpy': 'Numerical computing',
},
'Machine Learning': {
'scikit-learn': 'Core ML algorithms and utilities',
'xgboost': 'Gradient boosting (fast, accurate)',
'lightgbm': 'Another gradient boosting option',
},
'Deep Learning': {
'torch (PyTorch)': 'Neural networks',
'tensorflow': 'Alternative to PyTorch',
},
'Visualization': {
'matplotlib': 'Static plots',
'seaborn': 'Statistical visualizations',
'plotly': 'Interactive plots',
},
'Financial Data': {
'yfinance': 'Yahoo Finance data',
'pandas-datareader': 'Multiple data sources',
},
'NLP & Sentiment': {
'nltk': 'Natural language processing',
'transformers': 'Pre-trained language models',
},
'Model Interpretation': {
'shap': 'SHAP values for explainability',
'lime': 'Local interpretable explanations',
}
}
for category, libs in libraries.items():
print(f"\n{category}:")
for lib, desc in libs.items():
print(f" {lib}: {desc}")
# Check your environment
def check_library(name: str) -> tuple:
"""Check if a library is installed and get its version."""
try:
module = __import__(name.replace('-', '_'))
version = getattr(module, '__version__', 'unknown')
return True, version
except ImportError:
return False, None
print("Environment Check")
print("="*50)
essential_libs = ['pandas', 'numpy', 'sklearn', 'matplotlib', 'yfinance']
optional_libs = ['xgboost', 'lightgbm', 'torch', 'shap']
print("\nEssential Libraries:")
for lib in essential_libs:
installed, version = check_library(lib)
status = f"v{version}" if installed else "NOT INSTALLED"
symbol = "OK" if installed else "MISSING"
print(f" [{symbol}] {lib}: {status}")
print("\nOptional Libraries:")
for lib in optional_libs:
installed, version = check_library(lib)
status = f"v{version}" if installed else "not installed"
symbol = "OK" if installed else "--"
print(f" [{symbol}] {lib}: {status}")
# Jupyter Notebook Best Practices for ML
print("Best Practices for ML in Jupyter")
print("="*50)
best_practices = [
("Set random seeds", "np.random.seed(42) for reproducibility"),
("Suppress warnings carefully", "warnings.filterwarnings('ignore') when appropriate"),
("Use autoreload", "%load_ext autoreload for module development"),
("Track experiments", "Log hyperparameters and results systematically"),
("Checkpoint models", "Save model state periodically"),
("Memory management", "Delete large objects with del, use gc.collect()"),
("Version control", "Strip outputs before committing notebooks")
]
for practice, explanation in best_practices:
print(f"\n{practice}:")
print(f" {explanation}")
Exercise 1.3: Environment Validator (Guided)
Create a function that validates the ML environment is ready.
Solution 1.3
def validate_ml_environment(required_libs: List[str], optional_libs: List[str] = None) -> Dict:
"""
Validate that required ML libraries are installed.
Args:
required_libs: List of required library names
optional_libs: List of optional library names
Returns:
Dictionary with validation results
"""
if optional_libs is None:
optional_libs = []
results = {
'required': {},
'optional': {},
'ready': True,
'missing_required': []
}
# Check required libraries
for lib in required_libs:
try:
module = __import__(lib.replace('-', '_'))
version = getattr(module, '__version__', 'installed')
results['required'][lib] = {'installed': True, 'version': version}
except ImportError:
results['required'][lib] = {'installed': False, 'version': None}
results['ready'] = False
results['missing_required'].append(lib)
# Check optional libraries
for lib in optional_libs:
try:
module = __import__(lib.replace('-', '_'))
version = getattr(module, '__version__', 'installed')
results['optional'][lib] = {'installed': True, 'version': version}
except ImportError:
results['optional'][lib] = {'installed': False, 'version': None}
return results
Open-Ended Exercises
Exercise 1.4: Financial ML Challenges Analysis (Open-ended)
Create a comprehensive analysis of why a specific ML model might fail in financial markets.
Solution 1.4
def analyze_ml_challenges(df: pd.DataFrame, return_col: str = 'returns') -> dict:
"""
Analyze a dataset for potential ML challenges.
Args:
df: DataFrame with financial data
return_col: Name of the returns column
Returns:
Dictionary with challenge analysis
"""
report = {
'signal_to_noise': {},
'regime_stability': {},
'potential_lookahead': [],
'recommendations': []
}
# Signal-to-noise analysis
if return_col in df.columns:
returns = df[return_col].dropna()
mean_return = returns.mean()
std_return = returns.std()
snr = abs(mean_return) / std_return if std_return > 0 else 0
report['signal_to_noise'] = {
'mean_return': mean_return,
'std_return': std_return,
'ratio': snr,
'assessment': 'Low' if snr < 0.1 else 'Moderate' if snr < 0.2 else 'Good'
}
if snr < 0.1:
report['recommendations'].append(
"Very low signal-to-noise. Consider strong regularization."
)
# Regime stability
mid_point = len(df) // 2
first_half = df[return_col][:mid_point].dropna() if return_col in df.columns else None
second_half = df[return_col][mid_point:].dropna() if return_col in df.columns else None
if first_half is not None and len(first_half) > 0:
vol_ratio = second_half.std() / first_half.std()
report['regime_stability'] = {
'first_half_vol': first_half.std(),
'second_half_vol': second_half.std(),
'volatility_ratio': vol_ratio,
'stable': 0.5 < vol_ratio < 2.0
}
if not report['regime_stability']['stable']:
report['recommendations'].append(
"Significant regime change detected. Consider regime-aware models."
)
# Lookahead bias check
suspicious_keywords = ['future', 'next', 'forward', 'target', 'label']
for col in df.columns:
if any(kw in col.lower() for kw in suspicious_keywords):
report['potential_lookahead'].append(col)
if report['potential_lookahead']:
report['recommendations'].append(
f"Columns with potential lookahead: {report['potential_lookahead']}. Verify timing."
)
return report
# Test
import yfinance as yf
df = yf.Ticker("SPY").history(period="2y")
df['returns'] = df['Close'].pct_change()
df['future_return'] = df['returns'].shift(-1) # Intentional lookahead
analysis = analyze_ml_challenges(df, 'returns')
print("ML Challenges Analysis")
print("="*50)
print(f"\nSignal-to-Noise: {analysis['signal_to_noise']['assessment']}")
print(f" Ratio: {analysis['signal_to_noise']['ratio']:.4f}")
print(f"\nRegime Stability: {'Stable' if analysis['regime_stability'].get('stable') else 'Unstable'}")
print(f" Vol Ratio: {analysis['regime_stability'].get('volatility_ratio', 'N/A'):.2f}")
print(f"\nPotential Lookahead Issues: {analysis['potential_lookahead']}")
print(f"\nRecommendations:")
for rec in analysis['recommendations']:
print(f" - {rec}")
Exercise 1.5: ML Pipeline Builder (Open-ended)
Create a class that represents and validates an ML pipeline configuration.
Solution 1.5
class MLPipelineConfig:
"""
Configuration manager for ML pipelines.
"""
REQUIRED_STAGES = ['data', 'features', 'target', 'model', 'evaluation']
def __init__(self):
self.config = {}
self.validation_errors = []
def add_stage(self, stage_name: str, config: dict) -> 'MLPipelineConfig':
"""
Add a stage configuration.
Args:
stage_name: Name of the pipeline stage
config: Configuration dictionary for the stage
Returns:
Self for method chaining
"""
if 'method' not in config:
self.validation_errors.append(
f"Stage '{stage_name}' missing required 'method' key"
)
self.config[stage_name] = config
return self
def validate(self) -> bool:
"""
Validate the pipeline configuration.
Returns:
True if valid, False otherwise
"""
self.validation_errors = []
# Check required stages
for stage in self.REQUIRED_STAGES:
if stage not in self.config:
self.validation_errors.append(f"Missing required stage: {stage}")
# Check for common errors
if 'data' in self.config:
data_config = self.config['data']
if data_config.get('test_size', 0) > 0.5:
self.validation_errors.append(
"Warning: Test size > 50% may leave insufficient training data"
)
if 'model' in self.config:
model_config = self.config['model']
if model_config.get('cv_method') == 'random':
self.validation_errors.append(
"Warning: Random CV is inappropriate for time series. Use TimeSeriesSplit."
)
return len(self.validation_errors) == 0
def get_summary(self) -> str:
"""
Generate a summary report of the pipeline.
Returns:
Summary string
"""
lines = ["ML Pipeline Configuration Summary", "=" * 40]
for stage, config in self.config.items():
lines.append(f"\n{stage.upper()}:")
for key, value in config.items():
lines.append(f" {key}: {value}")
is_valid = self.validate()
lines.append(f"\nValidation: {'PASSED' if is_valid else 'FAILED'}")
if self.validation_errors:
lines.append("\nIssues:")
for error in self.validation_errors:
lines.append(f" - {error}")
return "\n".join(lines)
# Test the pipeline builder
pipeline = MLPipelineConfig()
pipeline.add_stage('data', {
'method': 'yfinance',
'symbols': ['SPY', 'AAPL'],
'period': '2y',
'test_size': 0.2
})
pipeline.add_stage('features', {
'method': 'technical_indicators',
'indicators': ['sma', 'rsi', 'macd']
})
pipeline.add_stage('target', {
'method': 'direction',
'horizon': 1
})
pipeline.add_stage('model', {
'method': 'random_forest',
'cv_method': 'time_series',
'n_splits': 5
})
pipeline.add_stage('evaluation', {
'method': 'classification_metrics',
'metrics': ['accuracy', 'f1', 'sharpe']
})
print(pipeline.get_summary())
Exercise 1.6: Complete ML Workflow Skeleton (Open-ended)
Create a complete but minimal ML workflow for trading that demonstrates all pipeline stages.
Solution 1.6
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score
import yfinance as yf
class TradingMLWorkflow:
"""
Complete ML workflow for trading applications.
"""
def __init__(self, symbol: str, period: str = '2y'):
self.symbol = symbol
self.period = period
self.data = None
self.model = None
self.results = None
def fetch_data(self) -> pd.DataFrame:
"""Stage 1: Data Collection"""
ticker = yf.Ticker(self.symbol)
self.data = ticker.history(period=self.period)
return self.data
def engineer_features(self) -> pd.DataFrame:
"""Stage 2 & 3: Data Preparation & Feature Engineering"""
df = self.data.copy()
# Basic features
df['returns'] = df['Close'].pct_change()
df['volatility'] = df['returns'].rolling(20).std()
df['momentum_5'] = df['Close'].pct_change(5)
df['momentum_20'] = df['Close'].pct_change(20)
df['volume_change'] = df['Volume'].pct_change()
# Distance from moving averages
df['sma_20'] = df['Close'].rolling(20).mean()
df['dist_sma'] = (df['Close'] - df['sma_20']) / df['sma_20']
self.data = df
return df
def create_target(self, horizon: int = 1) -> pd.DataFrame:
"""Stage 4: Target Engineering"""
df = self.data.copy()
df['target'] = (df['returns'].shift(-horizon) > 0).astype(int)
self.data = df
return df
def train_model(self, test_size: float = 0.2) -> dict:
"""Stage 5 & 6: Model Training & Evaluation"""
# Prepare data
feature_cols = ['returns', 'volatility', 'momentum_5', 'momentum_20',
'volume_change', 'dist_sma']
df_clean = self.data.dropna()
X = df_clean[feature_cols]
y = df_clean['target']
# Time series split
split_idx = int(len(X) * (1 - test_size))
X_train, X_test = X[:split_idx], X[split_idx:]
y_train, y_test = y[:split_idx], y[split_idx:]
# Train
self.model = RandomForestClassifier(
n_estimators=100,
max_depth=5,
random_state=42
)
self.model.fit(X_train, y_train)
# Predict
y_pred = self.model.predict(X_test)
y_prob = self.model.predict_proba(X_test)[:, 1]
# Calculate metrics
self.results = {
'accuracy': accuracy_score(y_test, y_pred),
'precision': precision_score(y_test, y_pred),
'recall': recall_score(y_test, y_pred),
'train_size': len(X_train),
'test_size': len(X_test),
'feature_importance': dict(zip(feature_cols, self.model.feature_importances_)),
'predictions': pd.DataFrame({
'actual': y_test.values,
'predicted': y_pred,
'probability': y_prob
}, index=y_test.index)
}
return self.results
def run_full_pipeline(self) -> dict:
"""Execute the complete pipeline."""
print(f"Running ML Pipeline for {self.symbol}")
print("=" * 50)
print("\n1. Fetching data...")
self.fetch_data()
print(f" Downloaded {len(self.data)} rows")
print("\n2. Engineering features...")
self.engineer_features()
print(f" Created 6 features")
print("\n3. Creating target...")
self.create_target()
print(f" Target: next-day direction")
print("\n4. Training and evaluating...")
results = self.train_model()
print(f"\n" + "=" * 50)
print("RESULTS:")
print(f" Accuracy: {results['accuracy']:.2%}")
print(f" Precision: {results['precision']:.2%}")
print(f" Recall: {results['recall']:.2%}")
print(f"\nTop Features:")
sorted_features = sorted(
results['feature_importance'].items(),
key=lambda x: -x[1]
)
for feat, imp in sorted_features[:3]:
print(f" {feat}: {imp:.3f}")
return results
# Run the workflow
workflow = TradingMLWorkflow('SPY', '2y')
results = workflow.run_full_pipeline()
Module Project: ML Project Template
Create a reusable project template that sets up the structure for any financial ML project.
# Module Project: ML Project Template
import os
from datetime import datetime
from typing import Dict, List, Optional
import json
class MLProjectTemplate:
"""
Creates a standardized project structure for financial ML projects.
This template ensures consistent organization and includes
all necessary components for reproducible ML experiments.
"""
def __init__(self, project_name: str, description: str = ""):
"""
Initialize a new ML project template.
Args:
project_name: Name of the project
description: Brief description of the project
"""
self.project_name = project_name
self.description = description
self.created_at = datetime.now().isoformat()
self.config = self._default_config()
self.directory_structure = self._default_structure()
def _default_config(self) -> Dict:
"""Return default project configuration."""
return {
'data': {
'source': 'yfinance',
'symbols': [],
'period': '2y',
'test_size': 0.2,
'validation_size': 0.1
},
'features': {
'price_based': ['returns', 'volatility', 'momentum'],
'technical': ['sma', 'rsi', 'macd'],
'scaling': 'standard'
},
'target': {
'type': 'classification',
'method': 'direction',
'horizon': 1
},
'model': {
'algorithm': 'random_forest',
'cv_method': 'time_series',
'n_splits': 5,
'hyperparameters': {}
},
'evaluation': {
'ml_metrics': ['accuracy', 'precision', 'recall', 'f1'],
'financial_metrics': ['sharpe', 'returns', 'max_drawdown']
},
'random_seed': 42
}
def _default_structure(self) -> Dict:
"""Return default directory structure."""
return {
'data': {
'raw': 'Original data files',
'processed': 'Cleaned and transformed data',
'features': 'Engineered features'
},
'notebooks': {
'01_data_exploration.ipynb': 'EDA notebook',
'02_feature_engineering.ipynb': 'Feature creation',
'03_model_training.ipynb': 'Model development',
'04_evaluation.ipynb': 'Results analysis'
},
'src': {
'data': 'Data loading and processing modules',
'features': 'Feature engineering code',
'models': 'Model definitions',
'evaluation': 'Evaluation utilities'
},
'models': 'Saved model files',
'reports': 'Generated reports and visualizations',
'config': 'Configuration files'
}
def set_symbols(self, symbols: List[str]) -> 'MLProjectTemplate':
"""Set the trading symbols for the project."""
self.config['data']['symbols'] = symbols
return self
def set_model(self, algorithm: str, **kwargs) -> 'MLProjectTemplate':
"""Set the model algorithm and hyperparameters."""
self.config['model']['algorithm'] = algorithm
self.config['model']['hyperparameters'] = kwargs
return self
def set_target(self, target_type: str, method: str, horizon: int = 1) -> 'MLProjectTemplate':
"""Set the prediction target configuration."""
self.config['target'] = {
'type': target_type,
'method': method,
'horizon': horizon
}
return self
def validate_config(self) -> Dict:
"""
Validate the project configuration.
Returns:
Dictionary with validation results and warnings
"""
results = {
'valid': True,
'errors': [],
'warnings': []
}
# Check required fields
if not self.config['data']['symbols']:
results['warnings'].append("No symbols specified")
# Check for common mistakes
test_size = self.config['data']['test_size']
val_size = self.config['data']['validation_size']
if test_size + val_size > 0.5:
results['warnings'].append(
f"Test + validation = {test_size + val_size:.0%}, leaving only "
f"{1 - test_size - val_size:.0%} for training"
)
# Check model configuration
if self.config['model']['cv_method'] == 'random':
results['errors'].append(
"Random CV is inappropriate for time series data"
)
results['valid'] = False
return results
def generate_readme(self) -> str:
"""Generate a README for the project."""
readme = f"""
# {self.project_name}
{self.description}
Created: {self.created_at}
## Configuration
### Data
- Source: {self.config['data']['source']}
- Symbols: {', '.join(self.config['data']['symbols']) or 'Not specified'}
- Period: {self.config['data']['period']}
### Model
- Algorithm: {self.config['model']['algorithm']}
- CV Method: {self.config['model']['cv_method']}
### Target
- Type: {self.config['target']['type']}
- Method: {self.config['target']['method']}
- Horizon: {self.config['target']['horizon']} day(s)
## Directory Structure
```
{self.project_name}/
├── data/
│ ├── raw/
│ ├── processed/
│ └── features/
├── notebooks/
├── src/
│ ├── data/
│ ├── features/
│ ├── models/
│ └── evaluation/
├── models/
├── reports/
└── config/
```
## Usage
1. Start with `notebooks/01_data_exploration.ipynb`
2. Engineer features in `notebooks/02_feature_engineering.ipynb`
3. Train models in `notebooks/03_model_training.ipynb`
4. Evaluate results in `notebooks/04_evaluation.ipynb`
"""
return readme.strip()
def get_summary(self) -> str:
"""Get a summary of the project template."""
validation = self.validate_config()
summary = f"""
{'='*60}
ML PROJECT TEMPLATE: {self.project_name}
{'='*60}
Description: {self.description or 'Not provided'}
Created: {self.created_at}
CONFIGURATION:
Data Source: {self.config['data']['source']}
Symbols: {self.config['data']['symbols'] or 'Not set'}
Model: {self.config['model']['algorithm']}
Target: {self.config['target']['method']} ({self.config['target']['type']})
VALIDATION: {'PASSED' if validation['valid'] else 'FAILED'}
"""
if validation['errors']:
summary += "\nERRORS:\n"
for error in validation['errors']:
summary += f" - {error}\n"
if validation['warnings']:
summary += "\nWARNINGS:\n"
for warning in validation['warnings']:
summary += f" - {warning}\n"
return summary
# Demo the project template
print("Creating ML Project Template...\n")
project = MLProjectTemplate(
project_name="SPY Direction Predictor",
description="Predict next-day direction of SPY using technical features"
)
# Configure the project
project.set_symbols(['SPY'])
project.set_model(
'random_forest',
n_estimators=100,
max_depth=5
)
project.set_target('classification', 'direction', horizon=1)
# Display summary
print(project.get_summary())
# Show README preview
print("\n" + "="*60)
print("README PREVIEW:")
print("="*60)
print(project.generate_readme()[:1000] + "...")
Key Takeaways
-
ML Types: Supervised learning (classification/regression) predicts with labels; unsupervised finds patterns without labels
-
Finance is Different: Low signal-to-noise, non-stationarity, regime changes, and adversarial environment make financial ML uniquely challenging
-
The Pipeline: Data → Features → Target → Model → Evaluation → Deployment - each stage requires finance-specific considerations
-
Lookahead Bias: The most common and deadly mistake - always verify your features don't include future information
-
Time Series Split: Never randomly shuffle financial data - always maintain temporal order in train/test splits
-
Tool Ecosystem: scikit-learn, xgboost, and pandas form the core; add specialized tools as needed
Next: Module 2 - Data Preparation
Learn how to properly prepare financial data for ML, including handling missing values, train-test splits for time series, and avoiding common data leakage pitfalls.
Module 2: Data Preparation
Part 1: ML Fundamentals for Finance
| Duration | Exercises |
|---|---|
| ~2.5 hours | 6 |
Learning Objectives
By the end of this module, you will be able to:
- Clean financial data while preserving signal integrity
- Implement proper train-test splits for time series
- Use time series cross-validation techniques
- Handle imbalanced class distributions in trading data
2.1 Financial Data Cleaning
Financial data has unique cleaning requirements that differ from general data science.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import yfinance as yf
import warnings
warnings.filterwarnings('ignore')
# Download sample data
print("Downloading sample data...")
df = yf.Ticker("AAPL").history(period="2y")
print(f"Downloaded {len(df)} rows")
print(f"\nColumns: {list(df.columns)}")
print(f"\nDate range: {df.index[0].date()} to {df.index[-1].date()}")
# Check for missing data
def analyze_missing_data(df: pd.DataFrame) -> pd.DataFrame:
"""
Analyze missing data patterns in a DataFrame.
Args:
df: Input DataFrame
Returns:
DataFrame with missing data statistics
"""
missing_stats = pd.DataFrame({
'missing_count': df.isnull().sum(),
'missing_pct': df.isnull().sum() / len(df) * 100,
'dtype': df.dtypes
})
return missing_stats.sort_values('missing_pct', ascending=False)
print("Missing Data Analysis:")
print(analyze_missing_data(df))
# Handling missing values in financial data
class FinancialDataCleaner:
"""
Clean financial time series data with appropriate methods.
"""
def __init__(self, df: pd.DataFrame):
self.df = df.copy()
self.cleaning_log = []
def handle_missing_prices(self, method: str = 'ffill') -> 'FinancialDataCleaner':
"""
Handle missing price data.
For OHLC data, forward fill is usually appropriate as it
represents "last known price" - what you'd actually trade at.
"""
price_cols = ['Open', 'High', 'Low', 'Close']
price_cols = [c for c in price_cols if c in self.df.columns]
before_missing = self.df[price_cols].isnull().sum().sum()
if method == 'ffill':
self.df[price_cols] = self.df[price_cols].ffill()
elif method == 'interpolate':
self.df[price_cols] = self.df[price_cols].interpolate(method='time')
after_missing = self.df[price_cols].isnull().sum().sum()
self.cleaning_log.append(
f"Filled {before_missing - after_missing} missing price values using {method}"
)
return self
def handle_missing_volume(self, fill_value: float = 0) -> 'FinancialDataCleaner':
"""
Handle missing volume data.
Missing volume often means no trades occurred, so filling with 0 is reasonable.
"""
if 'Volume' in self.df.columns:
before_missing = self.df['Volume'].isnull().sum()
self.df['Volume'] = self.df['Volume'].fillna(fill_value)
self.cleaning_log.append(
f"Filled {before_missing} missing volume values with {fill_value}"
)
return self
def detect_outliers(self, column: str, n_std: float = 5) -> pd.Series:
"""
Detect outliers using standard deviation method.
For financial data, 5 std is a reasonable threshold as
returns can legitimately be extreme.
"""
if column not in self.df.columns:
return pd.Series(dtype=bool)
mean = self.df[column].mean()
std = self.df[column].std()
lower_bound = mean - n_std * std
upper_bound = mean + n_std * std
is_outlier = (self.df[column] < lower_bound) | (self.df[column] > upper_bound)
self.cleaning_log.append(
f"Found {is_outlier.sum()} outliers in {column} (>{n_std} std)"
)
return is_outlier
def remove_zero_volume_days(self) -> 'FinancialDataCleaner':
"""
Remove days with zero volume (likely data errors or holidays).
"""
if 'Volume' in self.df.columns:
before_len = len(self.df)
self.df = self.df[self.df['Volume'] > 0]
removed = before_len - len(self.df)
self.cleaning_log.append(f"Removed {removed} zero-volume days")
return self
def get_cleaned_data(self) -> pd.DataFrame:
"""Return the cleaned DataFrame."""
return self.df
def get_cleaning_report(self) -> str:
"""Return a report of all cleaning operations."""
report = "Data Cleaning Report\n" + "="*40 + "\n"
for entry in self.cleaning_log:
report += f"- {entry}\n"
return report
# Apply cleaning
cleaner = FinancialDataCleaner(df)
cleaner.handle_missing_prices('ffill')
cleaner.handle_missing_volume(0)
cleaner.remove_zero_volume_days()
# Add returns and check for outliers
df_clean = cleaner.get_cleaned_data()
df_clean['returns'] = df_clean['Close'].pct_change()
cleaner.df = df_clean
outliers = cleaner.detect_outliers('returns', n_std=5)
print(cleaner.get_cleaning_report())
print(f"\nFinal dataset: {len(df_clean)} rows")
Exercise 2.1: Data Quality Checker (Guided)
Build a function to assess data quality for ML readiness.
Solution 2.1
def check_data_quality(df: pd.DataFrame, price_col: str = 'Close') -> dict:
"""
Check data quality for ML readiness.
"""
quality = {
'row_count': len(df),
'date_range': None,
'missing_data': {},
'issues': [],
'ready_for_ml': True
}
# Calculate date range
if hasattr(df.index, 'min') and hasattr(df.index, 'max'):
quality['date_range'] = {
'start': str(df.index.min()),
'end': str(df.index.max())
}
# Calculate missing data percentage for each column
for col in df.columns:
missing_pct = df[col].isnull().sum() / len(df) * 100
quality['missing_data'][col] = round(missing_pct, 2)
if missing_pct > 5:
quality['issues'].append(f"{col} has {missing_pct:.1f}% missing data")
quality['ready_for_ml'] = False
# Check for minimum data requirements
if len(df) < 252:
quality['issues'].append("Less than 252 rows (1 year of trading days)")
quality['ready_for_ml'] = False
return quality
2.2 Train-Test Split for Time Series
Critical: Never randomly shuffle time series data. Always maintain temporal order.
# Why random split fails for time series
print("Random Split vs Time Series Split")
print("="*50)
print("""
RANDOM SPLIT (WRONG for time series):
┌─────────────────────────────────────────┐
│ Train Train Test Train Test Train Test │
│ ↑ ↑ ↑ ↑ ↑ ↑ ↑ │
│ Randomly scattered across time │
└─────────────────────────────────────────┘
Problem: Model can "peek" at future data during training!
TIME SERIES SPLIT (CORRECT):
┌─────────────────────────────────────────┐
│ Train Train Train Train │ Test Test Test│
│ ←─── Earlier ──→ │ ←─ Later ──→ │
└─────────────────────────────────────────┘
Correct: Model only trained on past, tested on future.
""")
# Proper time series split implementation
from typing import Tuple
def time_series_split(
df: pd.DataFrame,
test_size: float = 0.2,
validation_size: float = 0.1
) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
"""
Split time series data maintaining temporal order.
Args:
df: DataFrame sorted by date
test_size: Proportion for test set (most recent data)
validation_size: Proportion for validation set
Returns:
Tuple of (train, validation, test) DataFrames
"""
n = len(df)
# Calculate split indices
test_start = int(n * (1 - test_size))
val_start = int(n * (1 - test_size - validation_size))
# Split
train = df.iloc[:val_start]
validation = df.iloc[val_start:test_start]
test = df.iloc[test_start:]
return train, validation, test
# Apply split
train, val, test = time_series_split(df_clean, test_size=0.2, validation_size=0.1)
print("Time Series Split Results:")
print("="*50)
print(f"\nTraining: {len(train):4d} rows ({len(train)/len(df_clean)*100:.1f}%)")
print(f" Period: {train.index[0].date()} to {train.index[-1].date()}")
print(f"\nValidation: {len(val):4d} rows ({len(val)/len(df_clean)*100:.1f}%)")
print(f" Period: {val.index[0].date()} to {val.index[-1].date()}")
print(f"\nTest: {len(test):4d} rows ({len(test)/len(df_clean)*100:.1f}%)")
print(f" Period: {test.index[0].date()} to {test.index[-1].date()}")
# Visualize the split
fig, ax = plt.subplots(figsize=(14, 5))
ax.plot(train.index, train['Close'], 'b-', label='Training', linewidth=1)
ax.plot(val.index, val['Close'], 'orange', label='Validation', linewidth=1)
ax.plot(test.index, test['Close'], 'g-', label='Test', linewidth=1)
# Add vertical lines at split points
ax.axvline(val.index[0], color='gray', linestyle='--', alpha=0.7)
ax.axvline(test.index[0], color='gray', linestyle='--', alpha=0.7)
ax.set_title('Time Series Train/Validation/Test Split')
ax.set_xlabel('Date')
ax.set_ylabel('Price')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Purging and Embargo
When features or labels overlap in time, we need additional safeguards.
# Purging and Embargo explained
print("Purging and Embargo")
print("="*50)
print("""
PROBLEM: Labels often span multiple days (e.g., 5-day returns)
Without purging:
┌────────────────────────────────────────────────────┐
│ Day 1 Day 2 Day 3 │ Day 4 Day 5 Day 6 │
│ ←─── Training ────→ │ ←──── Test ────→ │
│ └───────────────┘ │
│ Label for Day 3 includes Day 4 & 5! │
└────────────────────────────────────────────────────┘
LEAKAGE: Training label contains test period info
With purging (remove overlap):
┌────────────────────────────────────────────────────┐
│ Day 1 Day 2 │ PURGED │ Day 4 Day 5 Day 6 │
│ ←─ Training ─→│ │ ←──── Test ────→ │
└────────────────────────────────────────────────────┘
SAFE: Gap prevents information leakage
Embargo adds extra buffer after test start:
┌────────────────────────────────────────────────────┐
│ Day 1 │ PURGED │ EMBARGO │ Day 5 Day 6 Day 7 │
│ Train │ │ │ ←──── Test ────→ │
└────────────────────────────────────────────────────┘
""")
def apply_purge_embargo(
train_idx: pd.DatetimeIndex,
test_idx: pd.DatetimeIndex,
label_horizon: int = 5,
embargo_days: int = 1
) -> pd.DatetimeIndex:
"""
Remove training samples that overlap with test period.
Args:
train_idx: Training data index
test_idx: Test data index
label_horizon: How many days the label spans
embargo_days: Additional buffer days
Returns:
Purged training index
"""
test_start = test_idx.min()
# Purge: Remove samples whose labels would overlap with test
purge_cutoff = test_start - pd.Timedelta(days=label_horizon)
# Embargo: Additional buffer
embargo_cutoff = purge_cutoff - pd.Timedelta(days=embargo_days)
purged_idx = train_idx[train_idx < embargo_cutoff]
return purged_idx
# Example
purged_train_idx = apply_purge_embargo(
train.index,
test.index,
label_horizon=5,
embargo_days=1
)
print(f"\nOriginal training samples: {len(train)}")
print(f"After purge + embargo: {len(purged_train_idx)}")
print(f"Removed: {len(train) - len(purged_train_idx)} samples")
Exercise 2.2: Time Series Splitter (Guided)
Build a comprehensive time series splitter class.
Solution 2.2
class TimeSeriesSplitter:
"""
Handles time series data splitting with purging and embargo.
"""
def __init__(self, test_size: float = 0.2, purge_days: int = 0, embargo_days: int = 0):
self.test_size = test_size
self.purge_days = purge_days
self.embargo_days = embargo_days
def split(self, df: pd.DataFrame) -> dict:
"""
Split data into train and test sets.
"""
n = len(df)
# Calculate split index
split_idx = int(n * (1 - self.test_size))
# Initial split
train = df.iloc[:split_idx].copy()
test = df.iloc[split_idx:].copy()
# Apply purge and embargo
if self.purge_days > 0 or self.embargo_days > 0:
total_gap = self.purge_days + self.embargo_days
train = train.iloc[:-total_gap] if total_gap > 0 else train
return {
'train': train,
'test': test,
'train_size': len(train),
'test_size': len(test),
'split_date': df.index[split_idx]
}
2.3 Cross-Validation for Finance
Standard k-fold cross-validation doesn't work for time series.
# Time Series Cross-Validation with sklearn
from sklearn.model_selection import TimeSeriesSplit
# Create sample feature matrix
df_features = df_clean.copy()
df_features['returns'] = df_features['Close'].pct_change()
df_features['volatility'] = df_features['returns'].rolling(20).std()
df_features['momentum'] = df_features['Close'].pct_change(10)
df_features['target'] = (df_features['returns'].shift(-1) > 0).astype(int)
df_features = df_features.dropna()
X = df_features[['returns', 'volatility', 'momentum']]
y = df_features['target']
# Time series cross-validation
tscv = TimeSeriesSplit(n_splits=5)
print("Time Series Cross-Validation Folds:")
print("="*60)
for fold, (train_idx, test_idx) in enumerate(tscv.split(X), 1):
train_start = X.index[train_idx[0]].date()
train_end = X.index[train_idx[-1]].date()
test_start = X.index[test_idx[0]].date()
test_end = X.index[test_idx[-1]].date()
print(f"\nFold {fold}:")
print(f" Train: {train_start} to {train_end} ({len(train_idx)} samples)")
print(f" Test: {test_start} to {test_end} ({len(test_idx)} samples)")
# Visualize the CV folds
fig, axes = plt.subplots(5, 1, figsize=(14, 8), sharex=True)
for fold, (train_idx, test_idx) in enumerate(tscv.split(X)):
ax = axes[fold]
# Plot training period
train_dates = X.index[train_idx]
test_dates = X.index[test_idx]
ax.fill_between(train_dates, 0, 1, alpha=0.3, color='blue', label='Train')
ax.fill_between(test_dates, 0, 1, alpha=0.3, color='green', label='Test')
ax.set_ylabel(f'Fold {fold+1}')
ax.set_yticks([])
ax.set_xlim(X.index[0], X.index[-1])
if fold == 0:
ax.legend(loc='upper left')
axes[-1].set_xlabel('Date')
fig.suptitle('Time Series Cross-Validation Folds', fontsize=12)
plt.tight_layout()
plt.show()
# Run cross-validation with a model
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
tscv = TimeSeriesSplit(n_splits=5)
cv_results = []
print("Cross-Validation Results:")
print("="*50)
for fold, (train_idx, test_idx) in enumerate(tscv.split(X), 1):
X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
cv_results.append({'fold': fold, 'accuracy': acc, 'f1': f1})
print(f"Fold {fold}: Accuracy={acc:.3f}, F1={f1:.3f}")
# Summary
results_df = pd.DataFrame(cv_results)
print(f"\nMean Accuracy: {results_df['accuracy'].mean():.3f} (+/- {results_df['accuracy'].std():.3f})")
print(f"Mean F1: {results_df['f1'].mean():.3f} (+/- {results_df['f1'].std():.3f})")
Exercise 2.3: Custom CV Generator (Guided)
Build a custom cross-validation generator with gap support.
Solution 2.3
def time_series_cv_with_gap(
n_samples: int,
n_splits: int = 5,
gap: int = 0
) -> Generator[Tuple[np.ndarray, np.ndarray], None, None]:
"""
Generate time series CV indices with a gap between train and test.
"""
test_size = n_samples // (n_splits + 1)
for i in range(n_splits):
# Calculate train end index
train_end = test_size * (i + 1)
# Calculate test start and end (with gap)
test_start = train_end + gap
test_end = test_start + test_size
# Ensure we don't exceed array bounds
if test_end > n_samples:
test_end = n_samples
# Create index arrays
train_indices = np.arange(0, train_end)
test_indices = np.arange(test_start, test_end)
yield train_indices, test_indices
2.4 Handling Imbalanced Data
Trading labels are often imbalanced (e.g., more up days than down days).
# Check class balance
print("Class Balance Analysis:")
print("="*50)
class_counts = y.value_counts()
class_pcts = y.value_counts(normalize=True) * 100
print(f"\nClass Distribution:")
print(f" Class 0 (Down): {class_counts[0]} ({class_pcts[0]:.1f}%)")
print(f" Class 1 (Up): {class_counts[1]} ({class_pcts[1]:.1f}%)")
imbalance_ratio = class_counts.max() / class_counts.min()
print(f"\nImbalance Ratio: {imbalance_ratio:.2f}:1")
if imbalance_ratio > 1.5:
print("\nWarning: Dataset is imbalanced. Consider:")
print(" - Class weights")
print(" - Oversampling minority class")
print(" - Undersampling majority class")
print(" - Using appropriate metrics (F1, precision, recall)")
# Techniques for handling imbalanced data
from sklearn.utils.class_weight import compute_class_weight
# Method 1: Class weights
class_weights = compute_class_weight('balanced', classes=np.unique(y), y=y)
weight_dict = dict(zip(np.unique(y), class_weights))
print("Method 1: Class Weights")
print(f" Computed weights: {weight_dict}")
print(" Usage: model.fit(X, y, sample_weight=weights)")
# Method 2: Sample weights based on class
sample_weights = np.array([weight_dict[label] for label in y])
print(f"\nMethod 2: Sample Weights")
print(f" Shape: {sample_weights.shape}")
# Method 3: Using class_weight parameter in sklearn models
print(f"\nMethod 3: Built-in class_weight parameter")
print(" Usage: RandomForestClassifier(class_weight='balanced')")
# Compare balanced vs unbalanced training
from sklearn.metrics import classification_report
# Split data
split_idx = int(len(X) * 0.8)
X_train, X_test = X[:split_idx], X[split_idx:]
y_train, y_test = y[:split_idx], y[split_idx:]
# Model without class weights
model_unbalanced = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
model_unbalanced.fit(X_train, y_train)
y_pred_unbalanced = model_unbalanced.predict(X_test)
# Model with class weights
model_balanced = RandomForestClassifier(
n_estimators=100,
max_depth=5,
random_state=42,
class_weight='balanced'
)
model_balanced.fit(X_train, y_train)
y_pred_balanced = model_balanced.predict(X_test)
print("Comparison: Unbalanced vs Balanced Training")
print("="*60)
print("\nUnbalanced Model:")
print(classification_report(y_test, y_pred_unbalanced, target_names=['Down', 'Up']))
print("\nBalanced Model (class_weight='balanced'):")
print(classification_report(y_test, y_pred_balanced, target_names=['Down', 'Up']))
Open-Ended Exercises
Exercise 2.4: Complete Data Pipeline (Open-ended)
Build a complete data preparation pipeline class.
Solution 2.4
class DataPipeline:
"""
Complete data preparation pipeline for financial ML.
"""
def __init__(self, symbol: str, period: str = '2y'):
self.symbol = symbol
self.period = period
self.raw_data = None
self.clean_data = None
self.train = None
self.test = None
self.quality_report = {}
def fetch(self) -> 'DataPipeline':
"""Download data."""
self.raw_data = yf.Ticker(self.symbol).history(period=self.period)
self.quality_report['rows_downloaded'] = len(self.raw_data)
return self
def clean(self, fill_method: str = 'ffill') -> 'DataPipeline':
"""Clean missing values."""
df = self.raw_data.copy()
# Record missing before
missing_before = df.isnull().sum().sum()
# Fill prices
price_cols = ['Open', 'High', 'Low', 'Close']
df[price_cols] = df[price_cols].ffill()
# Fill volume
df['Volume'] = df['Volume'].fillna(0)
# Remove remaining NaN rows
df = df.dropna()
self.clean_data = df
self.quality_report['missing_filled'] = missing_before
self.quality_report['rows_after_cleaning'] = len(df)
return self
def handle_outliers(self, column: str = 'returns', n_std: float = 5) -> 'DataPipeline':
"""Detect and optionally cap outliers."""
df = self.clean_data.copy()
# Create returns if not exists
if column not in df.columns:
df['returns'] = df['Close'].pct_change()
mean = df[column].mean()
std = df[column].std()
lower = mean - n_std * std
upper = mean + n_std * std
outliers = (df[column] < lower) | (df[column] > upper)
self.quality_report['outliers_detected'] = outliers.sum()
# Cap outliers
df[column] = df[column].clip(lower, upper)
self.clean_data = df
return self
def split(self, test_size: float = 0.2) -> 'DataPipeline':
"""Time series train/test split."""
n = len(self.clean_data)
split_idx = int(n * (1 - test_size))
self.train = self.clean_data.iloc[:split_idx]
self.test = self.clean_data.iloc[split_idx:]
self.quality_report['train_size'] = len(self.train)
self.quality_report['test_size'] = len(self.test)
self.quality_report['split_date'] = str(self.clean_data.index[split_idx].date())
return self
def get_report(self) -> dict:
"""Return quality report."""
return self.quality_report
# Test
pipeline = DataPipeline('AAPL', '2y')
pipeline.fetch().clean().handle_outliers().split()
print("Pipeline Report:")
for key, value in pipeline.get_report().items():
print(f" {key}: {value}")
Exercise 2.5: Walk-Forward Validator (Open-ended)
Build a walk-forward validation system.
Solution 2.5
class WalkForwardValidator:
"""
Walk-forward validation for time series ML.
"""
def __init__(self, initial_train_size: int, test_size: int, step_size: int = None):
"""
Args:
initial_train_size: Initial training window size
test_size: Size of each test window
step_size: How much to move forward (defaults to test_size)
"""
self.initial_train_size = initial_train_size
self.test_size = test_size
self.step_size = step_size or test_size
self.results = []
def split(self, X: pd.DataFrame):
"""
Generate walk-forward splits.
Yields:
Tuple of (train_idx, test_idx)
"""
n = len(X)
train_end = self.initial_train_size
while train_end + self.test_size <= n:
train_idx = np.arange(0, train_end)
test_idx = np.arange(train_end, train_end + self.test_size)
yield train_idx, test_idx
train_end += self.step_size
def validate(self, X, y, model, metric_func) -> dict:
"""
Run walk-forward validation.
Args:
X: Feature DataFrame
y: Target Series
model: sklearn-compatible model
metric_func: Function(y_true, y_pred) -> float
Returns:
Dictionary with validation results
"""
self.results = []
for fold, (train_idx, test_idx) in enumerate(self.split(X), 1):
X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
# Train and predict
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Calculate metric
score = metric_func(y_test, y_pred)
self.results.append({
'fold': fold,
'train_start': X.index[train_idx[0]],
'train_end': X.index[train_idx[-1]],
'test_start': X.index[test_idx[0]],
'test_end': X.index[test_idx[-1]],
'train_size': len(train_idx),
'test_size': len(test_idx),
'score': score
})
return {
'folds': self.results,
'mean_score': np.mean([r['score'] for r in self.results]),
'std_score': np.std([r['score'] for r in self.results])
}
# Test
wfv = WalkForwardValidator(initial_train_size=200, test_size=50, step_size=50)
model = RandomForestClassifier(n_estimators=50, max_depth=3, random_state=42)
results = wfv.validate(X, y, model, accuracy_score)
print("Walk-Forward Validation Results:")
print(f"Mean Score: {results['mean_score']:.3f} (+/- {results['std_score']:.3f})")
print(f"\nFolds: {len(results['folds'])}")
for fold in results['folds'][:3]:
print(f" Fold {fold['fold']}: {fold['score']:.3f} ({fold['test_start'].date()} to {fold['test_end'].date()})")
Exercise 2.6: Imbalanced Data Handler (Open-ended)
Create a comprehensive solution for handling imbalanced trading labels.
Solution 2.6
class ImbalancedDataHandler:
"""
Handle imbalanced classification data in trading.
"""
def __init__(self, y: pd.Series):
self.y = y
self.class_counts = y.value_counts()
self.class_pcts = y.value_counts(normalize=True)
def analyze(self) -> dict:
"""Analyze class distribution."""
return {
'counts': self.class_counts.to_dict(),
'percentages': (self.class_pcts * 100).round(2).to_dict(),
'imbalance_ratio': self.class_counts.max() / self.class_counts.min(),
'majority_class': self.class_counts.idxmax(),
'minority_class': self.class_counts.idxmin()
}
def compute_class_weights(self, strategy: str = 'balanced') -> dict:
"""
Compute class weights.
Strategies:
- 'balanced': sklearn's balanced method
- 'inverse': simple inverse frequency
- 'sqrt_inverse': square root of inverse frequency
"""
classes = np.unique(self.y)
if strategy == 'balanced':
weights = compute_class_weight('balanced', classes=classes, y=self.y)
elif strategy == 'inverse':
weights = len(self.y) / (len(classes) * np.array([self.class_counts[c] for c in classes]))
elif strategy == 'sqrt_inverse':
weights = np.sqrt(len(self.y) / (len(classes) * np.array([self.class_counts[c] for c in classes])))
else:
weights = np.ones(len(classes))
return dict(zip(classes, weights))
def compute_sample_weights(self, class_weights: dict = None) -> np.ndarray:
"""Compute per-sample weights from class weights."""
if class_weights is None:
class_weights = self.compute_class_weights()
return np.array([class_weights[label] for label in self.y])
def get_report(self) -> str:
"""Generate analysis report."""
analysis = self.analyze()
weights = self.compute_class_weights()
report = "Imbalanced Data Analysis\n" + "="*40 + "\n"
report += f"\nClass Distribution:\n"
for cls, count in analysis['counts'].items():
pct = analysis['percentages'][cls]
report += f" Class {cls}: {count} ({pct}%)\n"
report += f"\nImbalance Ratio: {analysis['imbalance_ratio']:.2f}:1\n"
report += f"\nRecommended Class Weights:\n"
for cls, weight in weights.items():
report += f" Class {cls}: {weight:.3f}\n"
return report
# Test
handler = ImbalancedDataHandler(y)
print(handler.get_report())
# Get sample weights
sample_weights = handler.compute_sample_weights()
print(f"\nSample weights shape: {sample_weights.shape}")
print(f"Sample weights range: {sample_weights.min():.3f} - {sample_weights.max():.3f}")
Module Project: Data Preparation Pipeline
Build a complete, production-ready data preparation pipeline.
# Module Project: Complete Data Preparation Pipeline
import pandas as pd
import numpy as np
from typing import Dict, Tuple, Optional
from sklearn.model_selection import TimeSeriesSplit
from sklearn.utils.class_weight import compute_class_weight
import yfinance as yf
class MLDataPipeline:
"""
Production-ready data preparation pipeline for financial ML.
This pipeline handles the complete data preparation workflow:
1. Data fetching and validation
2. Cleaning and outlier handling
3. Time series splitting with purge/embargo
4. Class balance handling
5. Quality reporting
"""
def __init__(self, config: Dict = None):
"""
Initialize pipeline with configuration.
Args:
config: Pipeline configuration dictionary
"""
self.config = config or self._default_config()
self.raw_data = None
self.processed_data = None
self.train_data = None
self.test_data = None
self.quality_metrics = {}
self.processing_log = []
def _default_config(self) -> Dict:
"""Default pipeline configuration."""
return {
'data': {
'source': 'yfinance',
'period': '2y'
},
'cleaning': {
'fill_method': 'ffill',
'outlier_std': 5.0
},
'splitting': {
'test_size': 0.2,
'purge_days': 5,
'embargo_days': 1
},
'balance': {
'strategy': 'balanced'
}
}
def _log(self, message: str):
"""Add message to processing log."""
self.processing_log.append(message)
def fetch_data(self, symbol: str) -> 'MLDataPipeline':
"""
Fetch data for a symbol.
Args:
symbol: Ticker symbol
Returns:
Self for chaining
"""
period = self.config['data']['period']
self._log(f"Fetching {symbol} data for {period}...")
ticker = yf.Ticker(symbol)
self.raw_data = ticker.history(period=period)
self.quality_metrics['symbol'] = symbol
self.quality_metrics['rows_fetched'] = len(self.raw_data)
self.quality_metrics['date_range'] = (
str(self.raw_data.index[0].date()),
str(self.raw_data.index[-1].date())
)
self._log(f"Fetched {len(self.raw_data)} rows")
return self
def clean_data(self) -> 'MLDataPipeline':
"""
Clean the data by handling missing values and outliers.
Returns:
Self for chaining
"""
self._log("Cleaning data...")
df = self.raw_data.copy()
# Record initial missing
missing_before = df.isnull().sum().sum()
# Fill price data
price_cols = ['Open', 'High', 'Low', 'Close']
fill_method = self.config['cleaning']['fill_method']
df[price_cols] = df[price_cols].ffill()
# Fill volume
df['Volume'] = df['Volume'].fillna(0)
# Remove remaining NaN
df = df.dropna()
# Create returns
df['returns'] = df['Close'].pct_change()
# Handle outliers
outlier_std = self.config['cleaning']['outlier_std']
mean = df['returns'].mean()
std = df['returns'].std()
lower, upper = mean - outlier_std * std, mean + outlier_std * std
outliers = (df['returns'] < lower) | (df['returns'] > upper)
df['returns'] = df['returns'].clip(lower, upper)
# Drop first row (NaN from pct_change)
df = df.dropna()
self.processed_data = df
self.quality_metrics['missing_filled'] = missing_before
self.quality_metrics['outliers_clipped'] = outliers.sum()
self.quality_metrics['rows_after_cleaning'] = len(df)
self._log(f"Filled {missing_before} missing values")
self._log(f"Clipped {outliers.sum()} outliers")
return self
def split_data(self) -> 'MLDataPipeline':
"""
Split data into train and test sets with purge/embargo.
Returns:
Self for chaining
"""
self._log("Splitting data...")
df = self.processed_data
n = len(df)
test_size = self.config['splitting']['test_size']
purge_days = self.config['splitting']['purge_days']
embargo_days = self.config['splitting']['embargo_days']
# Calculate split point
split_idx = int(n * (1 - test_size))
# Apply purge and embargo to training set
gap = purge_days + embargo_days
train_end = split_idx - gap if gap > 0 else split_idx
self.train_data = df.iloc[:train_end].copy()
self.test_data = df.iloc[split_idx:].copy()
self.quality_metrics['train_size'] = len(self.train_data)
self.quality_metrics['test_size'] = len(self.test_data)
self.quality_metrics['gap_size'] = gap
self.quality_metrics['split_date'] = str(df.index[split_idx].date())
self._log(f"Training set: {len(self.train_data)} rows")
self._log(f"Test set: {len(self.test_data)} rows")
self._log(f"Gap (purge + embargo): {gap} rows")
return self
def compute_class_weights(self, target_col: str = 'target') -> Dict:
"""
Compute class weights for imbalanced data.
Args:
target_col: Name of target column
Returns:
Dictionary of class weights
"""
if target_col not in self.train_data.columns:
return {}
y = self.train_data[target_col]
classes = np.unique(y)
weights = compute_class_weight('balanced', classes=classes, y=y)
return dict(zip(classes, weights))
def get_quality_report(self) -> str:
"""
Generate a comprehensive quality report.
Returns:
Formatted quality report string
"""
report = []
report.append("="*60)
report.append("DATA PIPELINE QUALITY REPORT")
report.append("="*60)
report.append(f"\nSymbol: {self.quality_metrics.get('symbol', 'N/A')}")
report.append(f"Date Range: {self.quality_metrics.get('date_range', 'N/A')}")
report.append("\nData Volume:")
report.append(f" Rows fetched: {self.quality_metrics.get('rows_fetched', 'N/A')}")
report.append(f" Rows after cleaning: {self.quality_metrics.get('rows_after_cleaning', 'N/A')}")
report.append("\nData Quality:")
report.append(f" Missing values filled: {self.quality_metrics.get('missing_filled', 'N/A')}")
report.append(f" Outliers clipped: {self.quality_metrics.get('outliers_clipped', 'N/A')}")
report.append("\nTrain/Test Split:")
report.append(f" Training samples: {self.quality_metrics.get('train_size', 'N/A')}")
report.append(f" Test samples: {self.quality_metrics.get('test_size', 'N/A')}")
report.append(f" Gap (purge + embargo): {self.quality_metrics.get('gap_size', 'N/A')}")
report.append(f" Split date: {self.quality_metrics.get('split_date', 'N/A')}")
report.append("\nProcessing Log:")
for log_entry in self.processing_log:
report.append(f" - {log_entry}")
return "\n".join(report)
def run(self, symbol: str) -> 'MLDataPipeline':
"""
Run the complete pipeline.
Args:
symbol: Ticker symbol to process
Returns:
Self with processed data
"""
return self.fetch_data(symbol).clean_data().split_data()
# Run the complete pipeline
print("Running Complete Data Pipeline...")
print("="*60)
# Initialize with custom config
config = {
'data': {'source': 'yfinance', 'period': '2y'},
'cleaning': {'fill_method': 'ffill', 'outlier_std': 5.0},
'splitting': {'test_size': 0.2, 'purge_days': 5, 'embargo_days': 2},
'balance': {'strategy': 'balanced'}
}
pipeline = MLDataPipeline(config)
pipeline.run('SPY')
# Print quality report
print(pipeline.get_quality_report())
# Show sample of processed data
print("\n" + "="*60)
print("SAMPLE PROCESSED DATA:")
print("="*60)
print(pipeline.processed_data.tail())
Key Takeaways
-
Data Quality Matters: Clean missing values appropriately for financial data (forward fill for prices, zero for volume)
-
Never Random Shuffle: Time series data must maintain temporal order in train/test splits
-
Purge and Embargo: When labels span multiple days, add gaps between train and test to prevent leakage
-
Time Series CV: Use TimeSeriesSplit or custom walk-forward validation, never standard k-fold
-
Handle Imbalance: Use class weights or sample weights to address imbalanced trading labels
-
Document Everything: Keep a processing log to track all data transformations
Next: Module 3 - Feature Engineering
Learn how to create predictive features from price data, technical indicators, and statistical measures.
Module 3: Feature Engineering
Part 1: ML Fundamentals for Finance
| Duration | Exercises |
|---|---|
| ~2.5 hours | 6 |
Learning Objectives
By the end of this module, you will be able to:
- Create price-based features for ML models
- Convert technical indicators into ML features
- Build statistical features using rolling windows
- Apply feature selection techniques
3.1 Price-Based Features
The foundation of financial ML features starts with price data transformations.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import yfinance as yf
import warnings
warnings.filterwarnings('ignore')
# Download sample data
print("Downloading data...")
df = yf.Ticker("SPY").history(period="2y")
print(f"Downloaded {len(df)} rows")
# Price-based features
def create_price_features(df: pd.DataFrame) -> pd.DataFrame:
"""
Create price-based features from OHLCV data.
Args:
df: DataFrame with OHLCV columns
Returns:
DataFrame with added features
"""
features = df.copy()
# Returns at multiple horizons
for period in [1, 5, 10, 20]:
features[f'return_{period}d'] = features['Close'].pct_change(period)
# Log returns (more stable for ML)
features['log_return'] = np.log(features['Close'] / features['Close'].shift(1))
# Price ratios
features['close_to_open'] = features['Close'] / features['Open'] - 1
features['high_to_low'] = features['High'] / features['Low'] - 1
features['close_to_high'] = features['Close'] / features['High']
features['close_to_low'] = features['Close'] / features['Low']
# Gap features
features['overnight_gap'] = features['Open'] / features['Close'].shift(1) - 1
# Volume features
features['volume_change'] = features['Volume'].pct_change()
features['volume_ma_ratio'] = features['Volume'] / features['Volume'].rolling(20).mean()
return features
# Create features
df_features = create_price_features(df)
# Show new features
new_cols = [c for c in df_features.columns if c not in df.columns]
print(f"Created {len(new_cols)} price-based features:")
for col in new_cols:
print(f" - {col}")
# Volatility features
def create_volatility_features(df: pd.DataFrame, return_col: str = 'return_1d') -> pd.DataFrame:
"""
Create volatility-based features.
Args:
df: DataFrame with returns
return_col: Name of returns column
Returns:
DataFrame with volatility features
"""
features = df.copy()
# Historical volatility at different windows
for window in [5, 10, 20, 60]:
features[f'volatility_{window}d'] = features[return_col].rolling(window).std() * np.sqrt(252)
# Volatility ratio (short-term vs long-term)
features['vol_ratio_5_20'] = features['volatility_5d'] / features['volatility_20d']
# Parkinson volatility (uses high/low)
features['parkinson_vol'] = np.sqrt(
(np.log(features['High'] / features['Low']) ** 2).rolling(20).mean() / (4 * np.log(2))
) * np.sqrt(252)
# Garman-Klass volatility
features['gk_vol'] = np.sqrt(
(
0.5 * (np.log(features['High'] / features['Low']) ** 2) -
(2 * np.log(2) - 1) * (np.log(features['Close'] / features['Open']) ** 2)
).rolling(20).mean()
) * np.sqrt(252)
return features
# Add volatility features
df_features = create_volatility_features(df_features)
vol_cols = [c for c in df_features.columns if 'vol' in c.lower()]
print(f"\nVolatility features: {vol_cols}")
Exercise 3.1: Price Feature Generator (Guided)
Build a comprehensive price feature generator.
Solution 3.1
def generate_return_features(df: pd.DataFrame, horizons: list = None) -> pd.DataFrame:
"""
Generate return features at multiple horizons.
"""
if horizons is None:
horizons = [1, 2, 5, 10, 20]
features = df.copy()
for h in horizons:
# Calculate simple returns
features[f'return_{h}d'] = features['Close'].pct_change(h)
# Calculate log returns
features[f'log_return_{h}d'] = np.log(features['Close'] / features['Close'].shift(h))
return features
3.2 Technical Indicator Features
Converting traditional technical indicators into ML-ready features.
# Technical indicator features
def create_technical_features(df: pd.DataFrame) -> pd.DataFrame:
"""
Create technical indicator features.
Args:
df: DataFrame with OHLCV data
Returns:
DataFrame with technical features
"""
features = df.copy()
# Moving Averages
for period in [5, 10, 20, 50, 200]:
features[f'sma_{period}'] = features['Close'].rolling(period).mean()
features[f'ema_{period}'] = features['Close'].ewm(span=period, adjust=False).mean()
# Distance from MA (normalized)
features[f'dist_sma_{period}'] = (features['Close'] - features[f'sma_{period}']) / features[f'sma_{period}']
# RSI
delta = features['Close'].diff()
gain = delta.where(delta > 0, 0).rolling(14).mean()
loss = (-delta.where(delta < 0, 0)).rolling(14).mean()
rs = gain / loss
features['rsi_14'] = 100 - (100 / (1 + rs))
# RSI normalized to [-1, 1] for better ML performance
features['rsi_normalized'] = (features['rsi_14'] - 50) / 50
# MACD
ema12 = features['Close'].ewm(span=12, adjust=False).mean()
ema26 = features['Close'].ewm(span=26, adjust=False).mean()
features['macd'] = ema12 - ema26
features['macd_signal'] = features['macd'].ewm(span=9, adjust=False).mean()
features['macd_hist'] = features['macd'] - features['macd_signal']
# Normalize MACD by price
features['macd_normalized'] = features['macd'] / features['Close']
# Bollinger Bands
sma20 = features['Close'].rolling(20).mean()
std20 = features['Close'].rolling(20).std()
features['bb_upper'] = sma20 + 2 * std20
features['bb_lower'] = sma20 - 2 * std20
features['bb_position'] = (features['Close'] - features['bb_lower']) / (features['bb_upper'] - features['bb_lower'])
features['bb_width'] = (features['bb_upper'] - features['bb_lower']) / sma20
# ATR
high_low = features['High'] - features['Low']
high_close = abs(features['High'] - features['Close'].shift())
low_close = abs(features['Low'] - features['Close'].shift())
true_range = pd.concat([high_low, high_close, low_close], axis=1).max(axis=1)
features['atr_14'] = true_range.rolling(14).mean()
features['atr_normalized'] = features['atr_14'] / features['Close']
return features
# Create technical features
df_features = create_technical_features(df)
# Show technical features
tech_cols = ['rsi_14', 'rsi_normalized', 'macd_normalized', 'bb_position', 'bb_width', 'atr_normalized']
print("Sample technical features:")
print(df_features[tech_cols].tail())
# Crossover and divergence features
def create_crossover_features(df: pd.DataFrame) -> pd.DataFrame:
"""
Create features based on indicator crossovers and divergences.
Args:
df: DataFrame with technical indicators
Returns:
DataFrame with crossover features
"""
features = df.copy()
# Create MAs if not present
if 'sma_20' not in features.columns:
features['sma_20'] = features['Close'].rolling(20).mean()
if 'sma_50' not in features.columns:
features['sma_50'] = features['Close'].rolling(50).mean()
# Price above/below MA (binary)
features['above_sma_20'] = (features['Close'] > features['sma_20']).astype(int)
features['above_sma_50'] = (features['Close'] > features['sma_50']).astype(int)
# Days since last crossover
cross_20 = features['above_sma_20'].diff().abs()
features['days_since_sma20_cross'] = cross_20.groupby((cross_20 == 1).cumsum()).cumcount()
# MA crossovers
features['golden_cross'] = (
(features['sma_20'] > features['sma_50']) &
(features['sma_20'].shift(1) <= features['sma_50'].shift(1))
).astype(int)
features['death_cross'] = (
(features['sma_20'] < features['sma_50']) &
(features['sma_20'].shift(1) >= features['sma_50'].shift(1))
).astype(int)
# RSI oversold/overbought
if 'rsi_14' in features.columns:
features['rsi_oversold'] = (features['rsi_14'] < 30).astype(int)
features['rsi_overbought'] = (features['rsi_14'] > 70).astype(int)
return features
# Add crossover features
df_features = create_crossover_features(df_features)
cross_cols = ['above_sma_20', 'above_sma_50', 'golden_cross', 'death_cross']
print("Crossover features:")
print(df_features[cross_cols].tail(10))
Exercise 3.2: Technical Feature Builder (Guided)
Create an RSI feature with multiple periods and normalizations.
Solution 3.2
def build_rsi_features(df: pd.DataFrame, periods: list = None) -> pd.DataFrame:
"""
Build RSI features at multiple periods.
"""
if periods is None:
periods = [7, 14, 21]
features = df.copy()
for period in periods:
# Calculate price changes
delta = features['Close'].diff()
# Separate gains and losses
gain = delta.where(delta > 0, 0).rolling(period).mean()
loss = (-delta.where(delta < 0, 0)).rolling(period).mean()
# Calculate RSI
rs = gain / loss
features[f'rsi_{period}'] = 100 - (100 / (1 + rs))
# Normalized version
features[f'rsi_{period}_norm'] = (features[f'rsi_{period}'] - 50) / 50
return features
3.3 Statistical Features
Statistical transformations that help ML models understand data distribution.
# Statistical features
def create_statistical_features(df: pd.DataFrame, windows: list = None) -> pd.DataFrame:
"""
Create statistical features using rolling windows.
Args:
df: DataFrame with price data
windows: List of window sizes
Returns:
DataFrame with statistical features
"""
if windows is None:
windows = [5, 10, 20]
features = df.copy()
# Ensure returns exist
if 'returns' not in features.columns:
features['returns'] = features['Close'].pct_change()
for window in windows:
# Rolling statistics
features[f'rolling_mean_{window}'] = features['returns'].rolling(window).mean()
features[f'rolling_std_{window}'] = features['returns'].rolling(window).std()
features[f'rolling_min_{window}'] = features['returns'].rolling(window).min()
features[f'rolling_max_{window}'] = features['returns'].rolling(window).max()
# Z-score of returns
mean = features['returns'].rolling(window).mean()
std = features['returns'].rolling(window).std()
features[f'zscore_{window}'] = (features['returns'] - mean) / std
# Percentile rank
features[f'percentile_rank_{window}'] = features['returns'].rolling(window).apply(
lambda x: (x[-1] > x[:-1]).mean() if len(x) > 1 else 0.5,
raw=True
)
# Skewness and kurtosis
features[f'skew_{window}'] = features['returns'].rolling(window).skew()
features[f'kurtosis_{window}'] = features['returns'].rolling(window).kurt()
return features
# Create statistical features
df_features = create_statistical_features(df)
stat_cols = ['zscore_20', 'percentile_rank_20', 'skew_20', 'kurtosis_20']
print("Statistical features:")
print(df_features[stat_cols].dropna().tail())
# Autocorrelation features
def create_autocorrelation_features(df: pd.DataFrame, lags: list = None, window: int = 60) -> pd.DataFrame:
"""
Create autocorrelation features.
These capture momentum/mean-reversion patterns.
Args:
df: DataFrame with returns
lags: List of lags to compute
window: Rolling window size
Returns:
DataFrame with autocorrelation features
"""
if lags is None:
lags = [1, 2, 5, 10]
features = df.copy()
if 'returns' not in features.columns:
features['returns'] = features['Close'].pct_change()
for lag in lags:
# Rolling autocorrelation
features[f'autocorr_lag{lag}'] = features['returns'].rolling(window).apply(
lambda x: x.autocorr(lag=lag) if len(x) > lag else np.nan,
raw=False
)
return features
# Create autocorrelation features
df_autocorr = create_autocorrelation_features(df_features[['Close', 'returns']].copy())
autocorr_cols = [c for c in df_autocorr.columns if 'autocorr' in c]
print("Autocorrelation features:")
print(df_autocorr[autocorr_cols].dropna().tail())
Exercise 3.3: Z-Score Feature Builder (Guided)
Build z-score features for multiple columns.
Solution 3.3
def create_zscore_features(df: pd.DataFrame, columns: list, window: int = 20) -> pd.DataFrame:
"""
Create z-score normalized versions of columns.
"""
features = df.copy()
for col in columns:
if col not in features.columns:
continue
# Calculate rolling mean
roll_mean = features[col].rolling(window).mean()
# Calculate rolling standard deviation
roll_std = features[col].rolling(window).std()
# Calculate z-score
features[f'{col}_zscore'] = (features[col] - roll_mean) / roll_std
# Clip extreme values for stability
features[f'{col}_zscore'] = features[f'{col}_zscore'].clip(-3, 3)
return features
3.4 Feature Selection
Selecting the most predictive features and removing redundant ones.
# Feature correlation analysis
def analyze_feature_correlations(df: pd.DataFrame, feature_cols: list, threshold: float = 0.8) -> dict:
"""
Analyze correlations between features.
Args:
df: DataFrame with features
feature_cols: List of feature columns
threshold: Correlation threshold for "high" correlation
Returns:
Dictionary with correlation analysis
"""
# Calculate correlation matrix
corr_matrix = df[feature_cols].corr().abs()
# Find highly correlated pairs
high_corr_pairs = []
for i in range(len(feature_cols)):
for j in range(i + 1, len(feature_cols)):
if corr_matrix.iloc[i, j] >= threshold:
high_corr_pairs.append({
'feature_1': feature_cols[i],
'feature_2': feature_cols[j],
'correlation': corr_matrix.iloc[i, j]
})
return {
'correlation_matrix': corr_matrix,
'high_correlation_pairs': high_corr_pairs,
'n_high_corr': len(high_corr_pairs)
}
# Create a feature set and analyze
df_full = create_price_features(df)
df_full = create_technical_features(df_full)
df_full = df_full.dropna()
# Select numeric feature columns
feature_cols = ['return_1d', 'return_5d', 'return_20d', 'volatility_5d', 'volatility_20d',
'rsi_normalized', 'macd_normalized', 'bb_position', 'atr_normalized']
feature_cols = [c for c in feature_cols if c in df_full.columns]
corr_analysis = analyze_feature_correlations(df_full, feature_cols)
print(f"High correlation pairs (>0.8): {corr_analysis['n_high_corr']}")
for pair in corr_analysis['high_correlation_pairs'][:5]:
print(f" {pair['feature_1']} <-> {pair['feature_2']}: {pair['correlation']:.3f}")
# Feature importance with Random Forest
from sklearn.ensemble import RandomForestClassifier
def get_feature_importance(df: pd.DataFrame, feature_cols: list, target_col: str) -> pd.DataFrame:
"""
Calculate feature importance using Random Forest.
Args:
df: DataFrame with features and target
feature_cols: List of feature columns
target_col: Name of target column
Returns:
DataFrame with feature importances
"""
# Prepare data
df_clean = df[feature_cols + [target_col]].dropna()
X = df_clean[feature_cols]
y = df_clean[target_col]
# Train model
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
model.fit(X, y)
# Get importances
importance_df = pd.DataFrame({
'feature': feature_cols,
'importance': model.feature_importances_
}).sort_values('importance', ascending=False)
return importance_df
# Create target
df_full['target'] = (df_full['Close'].pct_change().shift(-1) > 0).astype(int)
# Get feature importances
importance_df = get_feature_importance(df_full, feature_cols, 'target')
print("Feature Importances (Random Forest):")
print(importance_df)
# Recursive feature elimination
from sklearn.feature_selection import RFE
def select_features_rfe(df: pd.DataFrame, feature_cols: list, target_col: str, n_features: int = 5) -> list:
"""
Select top features using Recursive Feature Elimination.
Args:
df: DataFrame with features and target
feature_cols: List of feature columns
target_col: Name of target column
n_features: Number of features to select
Returns:
List of selected feature names
"""
# Prepare data
df_clean = df[feature_cols + [target_col]].dropna()
X = df_clean[feature_cols]
y = df_clean[target_col]
# RFE with Random Forest
model = RandomForestClassifier(n_estimators=50, max_depth=3, random_state=42)
rfe = RFE(estimator=model, n_features_to_select=n_features, step=1)
rfe.fit(X, y)
# Get selected features
selected = [f for f, s in zip(feature_cols, rfe.support_) if s]
return selected
# Select top 5 features
selected_features = select_features_rfe(df_full, feature_cols, 'target', n_features=5)
print(f"\nTop 5 features (RFE):")
for i, feat in enumerate(selected_features, 1):
print(f" {i}. {feat}")
Open-Ended Exercises
Exercise 3.4: Complete Feature Library (Open-ended)
Build a comprehensive feature engineering library.
Solution 3.4
class FeatureLibrary:
"""
Comprehensive feature engineering library for financial ML.
"""
def __init__(self, df: pd.DataFrame):
self.original_df = df.copy()
self.features_df = df.copy()
self.feature_names = []
self.feature_groups = {}
def add_price_features(self, horizons: list = None) -> 'FeatureLibrary':
"""Add price-based features."""
if horizons is None:
horizons = [1, 5, 10, 20]
new_features = []
for h in horizons:
col = f'return_{h}d'
self.features_df[col] = self.features_df['Close'].pct_change(h)
new_features.append(col)
# Volatility
for w in [5, 20]:
col = f'volatility_{w}d'
self.features_df[col] = self.features_df['return_1d'].rolling(w).std() * np.sqrt(252)
new_features.append(col)
self.feature_names.extend(new_features)
self.feature_groups['price'] = new_features
return self
def add_technical_features(self) -> 'FeatureLibrary':
"""Add technical indicator features."""
new_features = []
# RSI
delta = self.features_df['Close'].diff()
gain = delta.where(delta > 0, 0).rolling(14).mean()
loss = (-delta.where(delta < 0, 0)).rolling(14).mean()
self.features_df['rsi_14'] = 100 - (100 / (1 + gain / loss))
self.features_df['rsi_normalized'] = (self.features_df['rsi_14'] - 50) / 50
new_features.extend(['rsi_14', 'rsi_normalized'])
# MACD
ema12 = self.features_df['Close'].ewm(span=12).mean()
ema26 = self.features_df['Close'].ewm(span=26).mean()
self.features_df['macd_normalized'] = (ema12 - ema26) / self.features_df['Close']
new_features.append('macd_normalized')
# Bollinger position
sma20 = self.features_df['Close'].rolling(20).mean()
std20 = self.features_df['Close'].rolling(20).std()
self.features_df['bb_position'] = (
(self.features_df['Close'] - (sma20 - 2*std20)) / (4*std20)
)
new_features.append('bb_position')
self.feature_names.extend(new_features)
self.feature_groups['technical'] = new_features
return self
def add_statistical_features(self, window: int = 20) -> 'FeatureLibrary':
"""Add statistical features."""
new_features = []
if 'return_1d' not in self.features_df.columns:
self.features_df['return_1d'] = self.features_df['Close'].pct_change()
# Z-score
mean = self.features_df['return_1d'].rolling(window).mean()
std = self.features_df['return_1d'].rolling(window).std()
self.features_df['return_zscore'] = (self.features_df['return_1d'] - mean) / std
self.features_df['return_zscore'] = self.features_df['return_zscore'].clip(-3, 3)
new_features.append('return_zscore')
# Skewness and kurtosis
self.features_df['skew_20'] = self.features_df['return_1d'].rolling(window).skew()
self.features_df['kurtosis_20'] = self.features_df['return_1d'].rolling(window).kurt()
new_features.extend(['skew_20', 'kurtosis_20'])
self.feature_names.extend(new_features)
self.feature_groups['statistical'] = new_features
return self
def select_features(self, target: pd.Series, n_features: int = 10) -> list:
"""Select top features using importance."""
df_clean = self.features_df[self.feature_names].dropna()
target_clean = target.loc[df_clean.index]
model = RandomForestClassifier(n_estimators=50, max_depth=3, random_state=42)
model.fit(df_clean, target_clean)
importance = pd.Series(model.feature_importances_, index=self.feature_names)
return importance.nlargest(n_features).index.tolist()
def get_feature_matrix(self, feature_list: list = None) -> pd.DataFrame:
"""Return clean feature matrix."""
if feature_list is None:
feature_list = self.feature_names
return self.features_df[feature_list].dropna()
def get_summary(self) -> str:
"""Return feature library summary."""
summary = f"Feature Library Summary\n" + "="*40 + "\n"
summary += f"Total features: {len(self.feature_names)}\n"
for group, features in self.feature_groups.items():
summary += f" {group}: {len(features)} features\n"
return summary
# Test
library = FeatureLibrary(df)
library.add_price_features().add_technical_features().add_statistical_features()
print(library.get_summary())
print(f"\nFeature matrix shape: {library.get_feature_matrix().shape}")
Exercise 3.5: Multi-Timeframe Features (Open-ended)
Create features that combine information from multiple timeframes.
Solution 3.5
def create_multi_timeframe_features(df: pd.DataFrame) -> pd.DataFrame:
"""
Create features combining multiple timeframes.
Args:
df: DataFrame with daily OHLCV data
Returns:
DataFrame with multi-timeframe features
"""
features = df.copy()
# Daily features (1-5 days)
features['momentum_daily'] = features['Close'].pct_change(5)
features['trend_daily'] = (features['Close'] > features['Close'].rolling(5).mean()).astype(int)
# Weekly features (5-20 days)
features['momentum_weekly'] = features['Close'].pct_change(20)
features['trend_weekly'] = (features['Close'] > features['Close'].rolling(20).mean()).astype(int)
# Monthly features (20-60 days)
features['momentum_monthly'] = features['Close'].pct_change(60)
features['trend_monthly'] = (features['Close'] > features['Close'].rolling(60).mean()).astype(int)
# Timeframe ratios
features['momentum_ratio_dw'] = features['momentum_daily'] / features['momentum_weekly'].abs().clip(0.001)
features['momentum_ratio_wm'] = features['momentum_weekly'] / features['momentum_monthly'].abs().clip(0.001)
# Trend alignment
features['trend_alignment'] = (
features['trend_daily'] + features['trend_weekly'] + features['trend_monthly']
) / 3 # 0 = all bearish, 1 = all bullish
# Divergence signals
features['daily_weekly_divergence'] = (
(features['trend_daily'] == 1) & (features['trend_weekly'] == 0)
).astype(int)
# Volatility across timeframes
features['vol_5d'] = features['Close'].pct_change().rolling(5).std() * np.sqrt(252)
features['vol_20d'] = features['Close'].pct_change().rolling(20).std() * np.sqrt(252)
features['vol_ratio'] = features['vol_5d'] / features['vol_20d']
return features
# Test
df_mtf = create_multi_timeframe_features(df)
mtf_cols = ['momentum_daily', 'momentum_weekly', 'momentum_monthly',
'trend_alignment', 'vol_ratio']
print("Multi-timeframe features:")
print(df_mtf[mtf_cols].dropna().tail())
Exercise 3.6: Feature Pipeline Builder (Open-ended)
Create a complete feature engineering pipeline that's ready for production.
Solution 3.6
import pickle
from sklearn.preprocessing import StandardScaler
class FeaturePipeline:
"""
Production-ready feature engineering pipeline.
"""
def __init__(self, config: dict = None):
self.config = config or self._default_config()
self.scaler = StandardScaler()
self.selected_features = None
self.feature_stats = {}
self.is_fitted = False
def _default_config(self) -> dict:
return {
'return_horizons': [1, 5, 10, 20],
'volatility_windows': [5, 20],
'rsi_period': 14,
'macd_params': (12, 26, 9),
'zscore_window': 20,
'n_features': 10
}
def _create_all_features(self, df: pd.DataFrame) -> pd.DataFrame:
"""Create all features from raw data."""
features = df.copy()
# Returns
for h in self.config['return_horizons']:
features[f'return_{h}d'] = features['Close'].pct_change(h)
# Volatility
for w in self.config['volatility_windows']:
features[f'vol_{w}d'] = features['return_1d'].rolling(w).std() * np.sqrt(252)
# RSI
period = self.config['rsi_period']
delta = features['Close'].diff()
gain = delta.where(delta > 0, 0).rolling(period).mean()
loss = (-delta.where(delta < 0, 0)).rolling(period).mean()
features['rsi'] = 100 - (100 / (1 + gain / loss))
features['rsi_norm'] = (features['rsi'] - 50) / 50
# MACD
fast, slow, signal = self.config['macd_params']
ema_fast = features['Close'].ewm(span=fast).mean()
ema_slow = features['Close'].ewm(span=slow).mean()
features['macd_norm'] = (ema_fast - ema_slow) / features['Close']
# Z-score
window = self.config['zscore_window']
mean = features['return_1d'].rolling(window).mean()
std = features['return_1d'].rolling(window).std()
features['return_zscore'] = ((features['return_1d'] - mean) / std).clip(-3, 3)
return features
def fit(self, df: pd.DataFrame, target: pd.Series) -> 'FeaturePipeline':
"""Fit the pipeline on training data."""
# Create features
features_df = self._create_all_features(df)
# Get feature columns
feature_cols = [c for c in features_df.columns
if c not in ['Open', 'High', 'Low', 'Close', 'Volume', 'Dividends', 'Stock Splits']]
# Clean data
clean_idx = features_df[feature_cols].dropna().index
X_clean = features_df.loc[clean_idx, feature_cols]
y_clean = target.loc[clean_idx]
# Feature selection
model = RandomForestClassifier(n_estimators=50, max_depth=3, random_state=42)
model.fit(X_clean, y_clean)
importance = pd.Series(model.feature_importances_, index=feature_cols)
self.selected_features = importance.nlargest(self.config['n_features']).index.tolist()
# Fit scaler
self.scaler.fit(X_clean[self.selected_features])
# Store statistics
self.feature_stats = {
'means': X_clean[self.selected_features].mean().to_dict(),
'stds': X_clean[self.selected_features].std().to_dict()
}
self.is_fitted = True
return self
def transform(self, df: pd.DataFrame) -> pd.DataFrame:
"""Transform new data using fitted pipeline."""
if not self.is_fitted:
raise ValueError("Pipeline not fitted. Call fit() first.")
features_df = self._create_all_features(df)
X = features_df[self.selected_features]
X_scaled = pd.DataFrame(
self.scaler.transform(X),
index=X.index,
columns=self.selected_features
)
return X_scaled
def fit_transform(self, df: pd.DataFrame, target: pd.Series) -> pd.DataFrame:
"""Fit and transform in one step."""
self.fit(df, target)
return self.transform(df)
def save(self, filepath: str):
"""Save pipeline to file."""
with open(filepath, 'wb') as f:
pickle.dump(self, f)
@classmethod
def load(cls, filepath: str) -> 'FeaturePipeline':
"""Load pipeline from file."""
with open(filepath, 'rb') as f:
return pickle.load(f)
# Test
target = (df['Close'].pct_change().shift(-1) > 0).astype(int)
pipeline = FeaturePipeline()
X = pipeline.fit_transform(df, target)
print(f"Selected features: {pipeline.selected_features}")
print(f"\nTransformed shape: {X.shape}")
print(X.dropna().tail())
Module Project: Feature Engineering Library
Build a comprehensive, reusable feature engineering library.
# Module Project: Feature Engineering Library
import pandas as pd
import numpy as np
from typing import Dict, List, Optional
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
class FinancialFeatureEngine:
"""
Comprehensive feature engineering library for financial ML.
This library provides a complete suite of features commonly used
in quantitative trading and financial machine learning.
"""
def __init__(self):
self.features_df = None
self.feature_catalog = {}
self.scaler = StandardScaler()
def fit(self, df: pd.DataFrame) -> 'FinancialFeatureEngine':
"""
Initialize the feature engine with data.
Args:
df: DataFrame with OHLCV data
Returns:
Self for method chaining
"""
self.features_df = df.copy()
return self
def add_returns(self, periods: List[int] = None) -> 'FinancialFeatureEngine':
"""
Add return features at multiple horizons.
Args:
periods: List of periods for returns
"""
if periods is None:
periods = [1, 2, 5, 10, 20]
features = []
for p in periods:
col = f'return_{p}d'
self.features_df[col] = self.features_df['Close'].pct_change(p)
features.append(col)
# Log returns
col_log = f'log_return_{p}d'
self.features_df[col_log] = np.log(
self.features_df['Close'] / self.features_df['Close'].shift(p)
)
features.append(col_log)
self.feature_catalog['returns'] = features
return self
def add_volatility(self, windows: List[int] = None) -> 'FinancialFeatureEngine':
"""
Add volatility features.
Args:
windows: List of rolling window sizes
"""
if windows is None:
windows = [5, 10, 20, 60]
# Ensure daily returns exist
if 'return_1d' not in self.features_df.columns:
self.features_df['return_1d'] = self.features_df['Close'].pct_change()
features = []
for w in windows:
col = f'volatility_{w}d'
self.features_df[col] = (
self.features_df['return_1d'].rolling(w).std() * np.sqrt(252)
)
features.append(col)
# Volatility ratios
if 'volatility_5d' in self.features_df.columns and 'volatility_20d' in self.features_df.columns:
self.features_df['vol_ratio_5_20'] = (
self.features_df['volatility_5d'] / self.features_df['volatility_20d']
)
features.append('vol_ratio_5_20')
self.feature_catalog['volatility'] = features
return self
def add_momentum(self) -> 'FinancialFeatureEngine':
"""
Add momentum indicators.
"""
features = []
# RSI
delta = self.features_df['Close'].diff()
gain = delta.where(delta > 0, 0).rolling(14).mean()
loss = (-delta.where(delta < 0, 0)).rolling(14).mean()
self.features_df['rsi_14'] = 100 - (100 / (1 + gain / loss))
self.features_df['rsi_normalized'] = (self.features_df['rsi_14'] - 50) / 50
features.extend(['rsi_14', 'rsi_normalized'])
# MACD
ema12 = self.features_df['Close'].ewm(span=12).mean()
ema26 = self.features_df['Close'].ewm(span=26).mean()
self.features_df['macd'] = ema12 - ema26
self.features_df['macd_signal'] = self.features_df['macd'].ewm(span=9).mean()
self.features_df['macd_normalized'] = self.features_df['macd'] / self.features_df['Close']
features.extend(['macd_normalized'])
# Stochastic
low_14 = self.features_df['Low'].rolling(14).min()
high_14 = self.features_df['High'].rolling(14).max()
self.features_df['stoch_k'] = (
(self.features_df['Close'] - low_14) / (high_14 - low_14) * 100
)
self.features_df['stoch_k_normalized'] = (self.features_df['stoch_k'] - 50) / 50
features.extend(['stoch_k_normalized'])
self.feature_catalog['momentum'] = features
return self
def add_trend(self) -> 'FinancialFeatureEngine':
"""
Add trend-following features.
"""
features = []
# Distance from moving averages
for period in [20, 50, 200]:
sma = self.features_df['Close'].rolling(period).mean()
col = f'dist_sma_{period}'
self.features_df[col] = (self.features_df['Close'] - sma) / sma
features.append(col)
# Trend direction
self.features_df['above_sma_20'] = (
self.features_df['Close'] > self.features_df['Close'].rolling(20).mean()
).astype(int)
self.features_df['above_sma_50'] = (
self.features_df['Close'] > self.features_df['Close'].rolling(50).mean()
).astype(int)
features.extend(['above_sma_20', 'above_sma_50'])
self.feature_catalog['trend'] = features
return self
def add_volume(self) -> 'FinancialFeatureEngine':
"""
Add volume-based features.
"""
features = []
# Volume change
self.features_df['volume_change'] = self.features_df['Volume'].pct_change()
features.append('volume_change')
# Volume relative to average
self.features_df['volume_ma_ratio'] = (
self.features_df['Volume'] / self.features_df['Volume'].rolling(20).mean()
)
features.append('volume_ma_ratio')
# Volume z-score
vol_mean = self.features_df['Volume'].rolling(20).mean()
vol_std = self.features_df['Volume'].rolling(20).std()
self.features_df['volume_zscore'] = (
(self.features_df['Volume'] - vol_mean) / vol_std
).clip(-3, 3)
features.append('volume_zscore')
self.feature_catalog['volume'] = features
return self
def add_statistical(self, window: int = 20) -> 'FinancialFeatureEngine':
"""
Add statistical features.
Args:
window: Rolling window size
"""
features = []
if 'return_1d' not in self.features_df.columns:
self.features_df['return_1d'] = self.features_df['Close'].pct_change()
# Z-score of returns
mean = self.features_df['return_1d'].rolling(window).mean()
std = self.features_df['return_1d'].rolling(window).std()
self.features_df['return_zscore'] = (
(self.features_df['return_1d'] - mean) / std
).clip(-3, 3)
features.append('return_zscore')
# Skewness and kurtosis
self.features_df['skew'] = self.features_df['return_1d'].rolling(window).skew()
self.features_df['kurtosis'] = self.features_df['return_1d'].rolling(window).kurt()
features.extend(['skew', 'kurtosis'])
self.feature_catalog['statistical'] = features
return self
def build_all(self) -> 'FinancialFeatureEngine':
"""
Build all available features.
"""
return (
self.add_returns()
.add_volatility()
.add_momentum()
.add_trend()
.add_volume()
.add_statistical()
)
def get_feature_names(self, groups: List[str] = None) -> List[str]:
"""
Get list of feature names.
Args:
groups: Optional list of feature groups to include
Returns:
List of feature names
"""
if groups is None:
groups = list(self.feature_catalog.keys())
features = []
for group in groups:
if group in self.feature_catalog:
features.extend(self.feature_catalog[group])
return features
def get_features(self, groups: List[str] = None, dropna: bool = True) -> pd.DataFrame:
"""
Get feature DataFrame.
Args:
groups: Optional list of feature groups to include
dropna: Whether to drop rows with missing values
Returns:
DataFrame with selected features
"""
feature_names = self.get_feature_names(groups)
df = self.features_df[feature_names]
if dropna:
df = df.dropna()
return df
def get_summary(self) -> str:
"""
Get a summary of all features.
Returns:
Formatted summary string
"""
summary = ["Financial Feature Engine Summary", "="*50]
total = 0
for group, features in self.feature_catalog.items():
summary.append(f"\n{group.upper()} ({len(features)} features):")
for f in features:
summary.append(f" - {f}")
total += len(features)
summary.append(f"\nTOTAL: {total} features")
return "\n".join(summary)
# Demo the feature engine
print("Building Financial Feature Engine...")
print("="*60)
# Initialize and build features
engine = FinancialFeatureEngine()
engine.fit(df).build_all()
# Print summary
print(engine.get_summary())
# Get feature matrix
features = engine.get_features()
print(f"\nFeature matrix shape: {features.shape}")
print("\nSample features:")
print(features.tail())
Key Takeaways
-
Normalize Features: Convert raw indicators to normalized forms (z-scores, percentages) for better ML performance
-
Multiple Horizons: Create features at different time scales (1, 5, 20, 60 days) to capture patterns at various frequencies
-
Feature Types: Combine price-based, technical, and statistical features for comprehensive coverage
-
Handle Correlations: Remove highly correlated features to reduce redundancy and overfitting
-
Feature Selection: Use importance scores or RFE to identify the most predictive features
-
Clip Outliers: Z-scores and other features should be clipped (e.g., ±3) for stability
Next: Module 4 - Target Engineering
Learn how to define prediction targets using the triple barrier method, meta-labeling, and proper handling of overlapping labels.
Module 4: Target Engineering
Part 1: ML Fundamentals for Finance
| Duration | Exercises |
|---|---|
| ~2.5 hours | 6 |
Learning Objectives
By the end of this module, you will be able to:
- Define effective prediction targets for trading
- Implement the triple barrier method for labeling
- Apply meta-labeling for signal filtering
- Avoid lookahead bias in target creation
4.1 Defining Targets
The target (what we predict) is just as important as the features.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import yfinance as yf
import warnings
warnings.filterwarnings('ignore')
# Download sample data
print("Downloading data...")
df = yf.Ticker("SPY").history(period="2y")
print(f"Downloaded {len(df)} rows")
# Different types of targets
print("Types of Prediction Targets")
print("="*50)
target_types = {
'Direction': {
'description': 'Up or down (binary classification)',
'example': '1 if tomorrow\'s return > 0, else 0',
'pros': 'Simple, clear signal',
'cons': 'Ignores magnitude, many small movements'
},
'Return Magnitude': {
'description': 'Actual return value (regression)',
'example': 'Tomorrow\'s return = 0.5%',
'pros': 'More information, enables position sizing',
'cons': 'Harder to predict, noisy'
},
'Multi-class Direction': {
'description': 'Strong up, weak up, neutral, weak down, strong down',
'example': '0: < -1%, 1: -1% to 0%, 2: 0% to 1%, 3: > 1%',
'pros': 'More nuanced than binary',
'cons': 'Class imbalance, harder to train'
},
'Triple Barrier': {
'description': 'First barrier hit: profit, loss, or time',
'example': 'Label based on which exit occurs first',
'pros': 'Most realistic for trading',
'cons': 'More complex to implement'
}
}
for target, details in target_types.items():
print(f"\n{target}:")
print(f" Description: {details['description']}")
print(f" Example: {details['example']}")
print(f" Pros: {details['pros']}")
print(f" Cons: {details['cons']}")
# Simple direction target
def create_direction_target(df: pd.DataFrame, horizon: int = 1) -> pd.Series:
"""
Create a simple binary direction target.
Args:
df: DataFrame with 'Close' column
horizon: Number of days to look ahead
Returns:
Series with binary labels (1 = up, 0 = down)
"""
future_return = df['Close'].pct_change(horizon).shift(-horizon)
target = (future_return > 0).astype(int)
return target
# Create targets at different horizons
df['target_1d'] = create_direction_target(df, horizon=1)
df['target_5d'] = create_direction_target(df, horizon=5)
df['target_20d'] = create_direction_target(df, horizon=20)
# Check class balance
print("Class Balance by Horizon:")
for col in ['target_1d', 'target_5d', 'target_20d']:
pct_up = df[col].mean() * 100
print(f" {col}: {pct_up:.1f}% up, {100-pct_up:.1f}% down")
# Multi-class target based on return magnitude
def create_multiclass_target(df: pd.DataFrame, horizon: int = 1, thresholds: list = None) -> pd.Series:
"""
Create multi-class target based on return magnitude.
Args:
df: DataFrame with 'Close' column
horizon: Number of days to look ahead
thresholds: Return thresholds for classes
Returns:
Series with multi-class labels
"""
if thresholds is None:
thresholds = [-0.02, -0.005, 0.005, 0.02] # -2%, -0.5%, 0.5%, 2%
future_return = df['Close'].pct_change(horizon).shift(-horizon)
# Create labels: 0 (strong down), 1 (weak down), 2 (neutral), 3 (weak up), 4 (strong up)
conditions = [
future_return <= thresholds[0],
(future_return > thresholds[0]) & (future_return <= thresholds[1]),
(future_return > thresholds[1]) & (future_return <= thresholds[2]),
(future_return > thresholds[2]) & (future_return <= thresholds[3]),
future_return > thresholds[3]
]
labels = [0, 1, 2, 3, 4]
target = np.select(conditions, labels, default=2)
return pd.Series(target, index=df.index)
# Create multi-class target
df['target_multiclass'] = create_multiclass_target(df, horizon=5)
print("\nMulti-class Target Distribution:")
label_names = ['Strong Down', 'Weak Down', 'Neutral', 'Weak Up', 'Strong Up']
for label, name in enumerate(label_names):
count = (df['target_multiclass'] == label).sum()
pct = count / len(df.dropna()) * 100
print(f" {label} ({name}): {count} ({pct:.1f}%)")
Exercise 4.1: Target Creator (Guided)
Build a flexible target creation function.
Solution 4.1
def create_target(df: pd.DataFrame, target_type: str = 'direction',
horizon: int = 1, threshold: float = 0.0) -> pd.Series:
"""
Create different types of prediction targets.
"""
# Calculate future return
future_return = df['Close'].pct_change(horizon).shift(-horizon)
if target_type == 'direction':
# Binary direction (1 if up, 0 if down)
target = (future_return > 0).astype(int)
elif target_type == 'return':
# Return the actual return value
target = future_return
elif target_type == 'threshold':
# Only label significant moves
target = pd.Series(0, index=df.index)
target[future_return > threshold] = 1
target[future_return < -threshold] = -1
else:
raise ValueError(f"Unknown target_type: {target_type}")
return target
4.2 The Triple Barrier Method
A more realistic labeling approach that mirrors actual trading exits.
# Triple Barrier Method explained
print("Triple Barrier Method")
print("="*50)
print("""
The triple barrier method labels based on which exit occurs first:
┌─────────────────────────────────────┐
│ Take Profit Barrier (upper) │ → Label = +1
│ ================================ │
│ │
Entry│ Price Path │
Point│ /\ /\ │
│ / \ / \ │
│ / \/ \ │
│ ================================ │
│ Stop Loss Barrier (lower) │ → Label = -1
└─────────────────────────────────────┘
│
Time Barrier → Label based on return
Three possible outcomes:
1. Price hits UPPER barrier first → Label = +1 (profitable)
2. Price hits LOWER barrier first → Label = -1 (loss)
3. TIME expires first → Label based on final return (+1, 0, or -1)
Benefits:
- Realistic: Mirrors stop-loss and take-profit orders
- Balanced: Can adjust barriers for class balance
- Actionable: Labels correspond to trading decisions
""")
# Implement triple barrier method
def triple_barrier_labels(
df: pd.DataFrame,
take_profit: float = 0.02,
stop_loss: float = 0.02,
max_holding: int = 10
) -> pd.DataFrame:
"""
Apply triple barrier method for labeling.
Args:
df: DataFrame with 'Close' column
take_profit: Upper barrier (e.g., 0.02 = 2%)
stop_loss: Lower barrier (e.g., 0.02 = 2%)
max_holding: Maximum holding period in days
Returns:
DataFrame with labels and exit info
"""
labels = []
for i in range(len(df) - max_holding):
entry_price = df['Close'].iloc[i]
entry_date = df.index[i]
# Calculate barriers
upper_barrier = entry_price * (1 + take_profit)
lower_barrier = entry_price * (1 - stop_loss)
# Look for barrier touches
for j in range(1, max_holding + 1):
if i + j >= len(df):
break
high = df['High'].iloc[i + j]
low = df['Low'].iloc[i + j]
close = df['Close'].iloc[i + j]
# Check upper barrier (take profit)
if high >= upper_barrier:
labels.append({
'entry_date': entry_date,
'exit_date': df.index[i + j],
'holding_period': j,
'exit_type': 'take_profit',
'label': 1
})
break
# Check lower barrier (stop loss)
if low <= lower_barrier:
labels.append({
'entry_date': entry_date,
'exit_date': df.index[i + j],
'holding_period': j,
'exit_type': 'stop_loss',
'label': -1
})
break
# Check time barrier
if j == max_holding:
final_return = (close - entry_price) / entry_price
label = 1 if final_return > 0 else (-1 if final_return < 0 else 0)
labels.append({
'entry_date': entry_date,
'exit_date': df.index[i + j],
'holding_period': j,
'exit_type': 'time_barrier',
'label': label
})
return pd.DataFrame(labels)
# Apply triple barrier
labels_df = triple_barrier_labels(df, take_profit=0.02, stop_loss=0.02, max_holding=10)
print("Triple Barrier Results:")
print("="*50)
print(f"Total labeled samples: {len(labels_df)}")
print(f"\nExit type distribution:")
print(labels_df['exit_type'].value_counts())
print(f"\nLabel distribution:")
print(labels_df['label'].value_counts())
print(f"\nAverage holding period: {labels_df['holding_period'].mean():.1f} days")
# Visualize triple barrier on a sample
def visualize_triple_barrier(df: pd.DataFrame, start_idx: int,
take_profit: float = 0.02, stop_loss: float = 0.02,
max_holding: int = 10):
"""
Visualize triple barrier for a single trade.
"""
entry_price = df['Close'].iloc[start_idx]
upper = entry_price * (1 + take_profit)
lower = entry_price * (1 - stop_loss)
# Get price path
end_idx = min(start_idx + max_holding, len(df) - 1)
prices = df['Close'].iloc[start_idx:end_idx + 1]
fig, ax = plt.subplots(figsize=(12, 5))
# Plot price
ax.plot(range(len(prices)), prices, 'b-', linewidth=2, label='Price')
# Plot barriers
ax.axhline(upper, color='green', linestyle='--', label=f'Take Profit ({take_profit:.1%})')
ax.axhline(lower, color='red', linestyle='--', label=f'Stop Loss ({stop_loss:.1%})')
ax.axhline(entry_price, color='gray', linestyle=':', alpha=0.5, label='Entry')
# Mark entry
ax.scatter([0], [entry_price], color='blue', s=100, zorder=5, marker='o')
ax.set_xlabel('Days')
ax.set_ylabel('Price')
ax.set_title('Triple Barrier Example')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Visualize an example
visualize_triple_barrier(df, start_idx=100, take_profit=0.02, stop_loss=0.02)
Exercise 4.2: Triple Barrier Labeler (Guided)
Create a simplified triple barrier labeling function.
Solution 4.2
def simple_triple_barrier(df: pd.DataFrame, profit_target: float = 0.02,
stop_loss: float = 0.02, max_days: int = 5) -> pd.Series:
"""
Simplified triple barrier that returns just the labels.
"""
labels = pd.Series(index=df.index, dtype=float)
for i in range(len(df) - max_days):
entry = df['Close'].iloc[i]
# Calculate barrier levels
upper = entry * (1 + profit_target)
lower = entry * (1 - stop_loss)
label = 0 # Default: time barrier
for j in range(1, max_days + 1):
# Check if upper barrier hit
if df['High'].iloc[i + j] >= upper:
label = 1
break
# Check if lower barrier hit
if df['Low'].iloc[i + j] <= lower:
label = -1
break
labels.iloc[i] = label
return labels
4.3 Meta-Labeling
Meta-labeling uses ML to filter another model's signals.
# Meta-labeling concept
print("Meta-Labeling")
print("="*50)
print("""
Meta-labeling is a two-model approach:
┌──────────────────────────────────────────────────────────────┐
│ STEP 1: Primary Model (e.g., trend-following strategy) │
│ Generates buy/sell signals │
│ Example: Buy when price crosses above SMA │
└───────────────────────────┬──────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ STEP 2: Meta-Model (ML classifier) │
│ Filters primary signals: "Should I take this trade?"│
│ Target: Was the primary signal profitable? │
└───────────────────────────┬──────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ FINAL: Only take trades where both models agree │
│ Primary says "buy" AND Meta says "likely profitable" │
└──────────────────────────────────────────────────────────────┘
Benefits:
1. Separates signal generation from signal filtering
2. Maintains interpretable primary model
3. Meta-model can consider additional features
4. Often improves precision at cost of recall
""")
# Implement meta-labeling
def create_meta_labels(
df: pd.DataFrame,
primary_signal: pd.Series,
holding_period: int = 5
) -> pd.Series:
"""
Create meta-labels for primary model signals.
Args:
df: DataFrame with price data
primary_signal: Series with primary model signals (1 for long, -1 for short)
holding_period: Days to hold position
Returns:
Series with meta-labels (1 if signal was profitable, 0 if not)
"""
meta_labels = pd.Series(index=df.index, dtype=float)
# Calculate forward returns
forward_return = df['Close'].pct_change(holding_period).shift(-holding_period)
# Only label when primary signal exists
signal_idx = primary_signal[primary_signal != 0].index
for idx in signal_idx:
if idx in forward_return.index and not pd.isna(forward_return.loc[idx]):
signal = primary_signal.loc[idx]
ret = forward_return.loc[idx]
# Meta-label: Was the signal profitable?
# Long signal (1) is profitable if return > 0
# Short signal (-1) is profitable if return < 0
profitable = (signal * ret) > 0
meta_labels.loc[idx] = 1 if profitable else 0
return meta_labels
# Create a simple primary model (MA crossover)
df['sma_20'] = df['Close'].rolling(20).mean()
df['sma_50'] = df['Close'].rolling(50).mean()
# Primary signal: 1 when short MA above long MA
df['primary_signal'] = np.where(df['sma_20'] > df['sma_50'], 1, -1)
# Generate meta-labels
meta_labels = create_meta_labels(df, df['primary_signal'], holding_period=10)
print("Meta-Label Distribution:")
print(meta_labels.dropna().value_counts())
print(f"\nWin rate of primary signals: {meta_labels.dropna().mean():.1%}")
# Complete meta-labeling example
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
def meta_labeling_example(df: pd.DataFrame):
"""
Complete meta-labeling workflow example.
"""
# Step 1: Create features
features = pd.DataFrame(index=df.index)
features['returns_5d'] = df['Close'].pct_change(5)
features['volatility'] = df['Close'].pct_change().rolling(20).std()
features['rsi'] = 50 # Placeholder
delta = df['Close'].diff()
gain = delta.where(delta > 0, 0).rolling(14).mean()
loss = (-delta.where(delta < 0, 0)).rolling(14).mean()
features['rsi'] = 100 - (100 / (1 + gain / loss))
features['volume_ratio'] = df['Volume'] / df['Volume'].rolling(20).mean()
# Step 2: Create primary signal
sma_20 = df['Close'].rolling(20).mean()
sma_50 = df['Close'].rolling(50).mean()
primary_signal = pd.Series(
np.where(sma_20 > sma_50, 1, -1),
index=df.index
)
# Step 3: Create meta-labels
meta_labels = create_meta_labels(df, primary_signal, holding_period=10)
# Step 4: Prepare data for meta-model
feature_cols = ['returns_5d', 'volatility', 'rsi', 'volume_ratio']
df_ml = pd.concat([features[feature_cols], meta_labels.rename('meta_label')], axis=1)
df_ml = df_ml.dropna()
# Only use rows where we have a signal
df_ml = df_ml[df_ml['meta_label'].notna()]
X = df_ml[feature_cols]
y = df_ml['meta_label'].astype(int)
# Step 5: Train meta-model (time series split)
split_idx = int(len(X) * 0.8)
X_train, X_test = X[:split_idx], X[split_idx:]
y_train, y_test = y[:split_idx], y[split_idx:]
meta_model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
meta_model.fit(X_train, y_train)
# Step 6: Evaluate
y_pred = meta_model.predict(X_test)
return {
'primary_win_rate': y.mean(),
'meta_accuracy': accuracy_score(y_test, y_pred),
'test_predictions': pd.Series(y_pred, index=y_test.index),
'feature_importance': dict(zip(feature_cols, meta_model.feature_importances_))
}
# Run meta-labeling example
results = meta_labeling_example(df)
print("Meta-Labeling Results:")
print("="*50)
print(f"Primary Model Win Rate: {results['primary_win_rate']:.1%}")
print(f"Meta-Model Accuracy: {results['meta_accuracy']:.1%}")
print(f"\nFeature Importance:")
for feat, imp in sorted(results['feature_importance'].items(), key=lambda x: -x[1]):
print(f" {feat}: {imp:.3f}")
Exercise 4.3: Meta-Label Generator (Guided)
Create a function that generates meta-labels for any primary signal.
Solution 4.3
def generate_meta_labels(df: pd.DataFrame, signal_col: str,
profit_threshold: float = 0.01,
holding_days: int = 5) -> pd.Series:
"""
Generate meta-labels for a given signal column.
"""
# Calculate forward return
forward_return = df['Close'].pct_change(holding_days).shift(-holding_days)
# Get signal values
signal = df[signal_col]
# Calculate actual profit (signal * return)
actual_profit = signal * forward_return
# Create meta-label (1 if profit > threshold, else 0)
meta_label = (actual_profit > profit_threshold).astype(int)
# Only keep labels where signal was non-zero
meta_label = meta_label.where(signal != 0)
return meta_label
4.4 Avoiding Lookahead Bias
The most common and deadly mistake in financial ML.
# Lookahead bias examples
print("Common Lookahead Bias Mistakes")
print("="*50)
mistakes = [
{
'name': 'Using Same-Day Close in Features',
'wrong': 'feature = (close - sma) / sma # close includes today',
'right': 'feature = (close.shift(1) - sma.shift(1)) / sma.shift(1)',
'explanation': 'Features should only use information available at decision time'
},
{
'name': 'Scaling with Full Dataset Statistics',
'wrong': 'scaler.fit(X) # Uses future data statistics',
'right': 'scaler.fit(X_train) # Only use training data',
'explanation': 'Statistics (mean, std) must come only from past data'
},
{
'name': 'Feature Selection Using All Data',
'wrong': 'Select features using correlation with target on all data',
'right': 'Select features only on training data',
'explanation': 'Feature selection is part of model fitting'
},
{
'name': 'Using Adjusted Prices',
'wrong': 'Split-adjusted prices for historical signals',
'right': 'Use unadjusted prices, adjust at point in time',
'explanation': 'Price adjustments are applied retroactively'
}
]
for i, mistake in enumerate(mistakes, 1):
print(f"\n{i}. {mistake['name']}")
print(f" WRONG: {mistake['wrong']}")
print(f" RIGHT: {mistake['right']}")
print(f" Why: {mistake['explanation']}")
# Lookahead bias checker
def check_lookahead_bias(df: pd.DataFrame, feature_cols: list, target_col: str) -> dict:
"""
Check for potential lookahead bias in features.
Args:
df: DataFrame with features and target
feature_cols: List of feature columns
target_col: Name of target column
Returns:
Dictionary with warnings and analysis
"""
warnings = []
analysis = {}
# Check correlation between features and target
# Extremely high correlation might indicate lookahead
for col in feature_cols:
if col in df.columns and target_col in df.columns:
corr = df[col].corr(df[target_col])
if abs(corr) > 0.5:
warnings.append(
f"High correlation ({corr:.2f}) between '{col}' and target - "
f"possible lookahead bias"
)
analysis[col] = {'correlation': corr}
# Check for suspicious column names
suspicious_keywords = ['future', 'forward', 'next', 'tomorrow']
for col in feature_cols:
for keyword in suspicious_keywords:
if keyword in col.lower():
warnings.append(
f"Column '{col}' contains suspicious keyword '{keyword}'"
)
return {
'warnings': warnings,
'analysis': analysis,
'has_potential_issues': len(warnings) > 0
}
# Create some features with potential issues
test_df = df.copy()
test_df['future_return'] = test_df['Close'].pct_change().shift(-1) # LOOKAHEAD!
test_df['past_return'] = test_df['Close'].pct_change() # OK
test_df['target'] = (test_df['Close'].pct_change().shift(-1) > 0).astype(int)
# Check for bias
result = check_lookahead_bias(
test_df.dropna(),
['future_return', 'past_return'],
'target'
)
print("Lookahead Bias Check:")
print("="*50)
print(f"Potential issues found: {result['has_potential_issues']}")
if result['warnings']:
print("\nWarnings:")
for warning in result['warnings']:
print(f" - {warning}")
Open-Ended Exercises
Exercise 4.4: Adaptive Barrier Labels (Open-ended)
Create triple barrier labels with adaptive barriers based on volatility.
Solution 4.4
def adaptive_triple_barrier(df: pd.DataFrame, vol_multiplier: float = 2.0,
vol_window: int = 20, max_days: int = 10) -> pd.DataFrame:
"""
Triple barrier with volatility-adaptive barriers.
Args:
df: DataFrame with OHLCV data
vol_multiplier: Multiplier for volatility to set barriers
vol_window: Window for volatility calculation
max_days: Maximum holding period
Returns:
DataFrame with labels and barrier info
"""
# Calculate daily volatility
returns = df['Close'].pct_change()
volatility = returns.rolling(vol_window).std()
results = []
for i in range(vol_window, len(df) - max_days):
entry_price = df['Close'].iloc[i]
entry_date = df.index[i]
# Adaptive barrier width based on current volatility
current_vol = volatility.iloc[i]
barrier_width = current_vol * vol_multiplier
upper = entry_price * (1 + barrier_width)
lower = entry_price * (1 - barrier_width)
label = 0
exit_type = 'time'
exit_day = max_days
for j in range(1, max_days + 1):
if df['High'].iloc[i + j] >= upper:
label = 1
exit_type = 'profit'
exit_day = j
break
if df['Low'].iloc[i + j] <= lower:
label = -1
exit_type = 'stop'
exit_day = j
break
results.append({
'date': entry_date,
'entry_price': entry_price,
'volatility': current_vol,
'barrier_width': barrier_width,
'upper_barrier': upper,
'lower_barrier': lower,
'label': label,
'exit_type': exit_type,
'exit_day': exit_day
})
return pd.DataFrame(results)
# Test
adaptive_labels = adaptive_triple_barrier(df, vol_multiplier=2.0, vol_window=20, max_days=10)
print("Adaptive Triple Barrier Results:")
print(f"Label distribution:\n{adaptive_labels['label'].value_counts()}")
print(f"\nAverage barrier width: {adaptive_labels['barrier_width'].mean():.2%}")
print(f"Barrier width range: {adaptive_labels['barrier_width'].min():.2%} to {adaptive_labels['barrier_width'].max():.2%}")
Exercise 4.5: Label Quality Analyzer (Open-ended)
Build a comprehensive label quality analysis tool.
Solution 4.5
class LabelAnalyzer:
"""
Analyze label quality for financial ML.
"""
def __init__(self, labels: pd.Series, holding_period: int = 1):
self.labels = labels.dropna()
self.holding_period = holding_period
def class_distribution(self) -> dict:
"""Analyze class distribution."""
counts = self.labels.value_counts()
pcts = self.labels.value_counts(normalize=True) * 100
return {
'counts': counts.to_dict(),
'percentages': pcts.round(2).to_dict(),
'imbalance_ratio': counts.max() / counts.min()
}
def label_overlap(self) -> dict:
"""Check for overlapping labels."""
# Labels overlap if they're within holding_period of each other
label_dates = self.labels.index
overlap_count = 0
for i, date in enumerate(label_dates[:-1]):
next_date = label_dates[i + 1]
if (next_date - date).days < self.holding_period:
overlap_count += 1
return {
'total_labels': len(self.labels),
'overlapping': overlap_count,
'overlap_percentage': overlap_count / len(self.labels) * 100
}
def label_uniqueness(self) -> pd.Series:
"""
Calculate uniqueness score for each label.
Labels are less unique if they overlap with many others.
"""
uniqueness = pd.Series(1.0, index=self.labels.index)
for i, date in enumerate(self.labels.index):
# Count overlapping labels
start = date - pd.Timedelta(days=self.holding_period)
end = date + pd.Timedelta(days=self.holding_period)
overlapping = self.labels[(self.labels.index >= start) &
(self.labels.index <= end) &
(self.labels.index != date)]
if len(overlapping) > 0:
uniqueness.loc[date] = 1 / (1 + len(overlapping))
return uniqueness
def sample_weights(self) -> pd.Series:
"""Generate sample weights based on uniqueness."""
uniqueness = self.label_uniqueness()
# Normalize to sum to number of samples
weights = uniqueness / uniqueness.sum() * len(uniqueness)
return weights
def get_report(self) -> str:
"""Generate full analysis report."""
dist = self.class_distribution()
overlap = self.label_overlap()
uniqueness = self.label_uniqueness()
report = "Label Quality Report\n" + "="*50 + "\n"
report += "\nClass Distribution:\n"
for cls, count in dist['counts'].items():
pct = dist['percentages'][cls]
report += f" Class {cls}: {count} ({pct}%)\n"
report += f" Imbalance Ratio: {dist['imbalance_ratio']:.2f}\n"
report += f"\nLabel Overlap:\n"
report += f" Overlapping labels: {overlap['overlapping']} ({overlap['overlap_percentage']:.1f}%)\n"
report += f"\nLabel Uniqueness:\n"
report += f" Mean uniqueness: {uniqueness.mean():.3f}\n"
report += f" Min uniqueness: {uniqueness.min():.3f}\n"
return report
# Test
test_labels = df['target_5d'].dropna()
analyzer = LabelAnalyzer(test_labels, holding_period=5)
print(analyzer.get_report())
Exercise 4.6: Complete Labeling System (Open-ended)
Build a production-ready labeling system.
Solution 4.6
class LabelingSystem:
"""
Production-ready labeling system for financial ML.
"""
METHODS = ['direction', 'triple_barrier', 'threshold', 'meta']
def __init__(self, df: pd.DataFrame):
self.df = df.copy()
self.labels = None
self.sample_weights = None
self.config = {}
def create_labels(self, method: str = 'direction', **kwargs) -> 'LabelingSystem':
"""
Create labels using specified method.
Args:
method: Labeling method
**kwargs: Method-specific parameters
"""
self.config = {'method': method, **kwargs}
if method == 'direction':
horizon = kwargs.get('horizon', 1)
future_return = self.df['Close'].pct_change(horizon).shift(-horizon)
self.labels = (future_return > 0).astype(int)
elif method == 'triple_barrier':
tp = kwargs.get('take_profit', 0.02)
sl = kwargs.get('stop_loss', 0.02)
max_hold = kwargs.get('max_holding', 10)
self.labels = pd.Series(index=self.df.index, dtype=float)
for i in range(len(self.df) - max_hold):
entry = self.df['Close'].iloc[i]
upper = entry * (1 + tp)
lower = entry * (1 - sl)
label = 0
for j in range(1, max_hold + 1):
if self.df['High'].iloc[i + j] >= upper:
label = 1
break
if self.df['Low'].iloc[i + j] <= lower:
label = -1
break
self.labels.iloc[i] = label
elif method == 'threshold':
horizon = kwargs.get('horizon', 5)
threshold = kwargs.get('threshold', 0.02)
future_return = self.df['Close'].pct_change(horizon).shift(-horizon)
self.labels = pd.Series(0, index=self.df.index)
self.labels[future_return > threshold] = 1
self.labels[future_return < -threshold] = -1
else:
raise ValueError(f"Unknown method: {method}. Use one of {self.METHODS}")
return self
def check_bias(self, feature_df: pd.DataFrame) -> dict:
"""Check for potential lookahead bias."""
warnings = []
for col in feature_df.columns:
# Check suspicious names
for keyword in ['future', 'forward', 'next']:
if keyword in col.lower():
warnings.append(f"Suspicious column name: {col}")
# Check high correlation
if self.labels is not None:
corr = feature_df[col].corr(self.labels)
if abs(corr) > 0.5:
warnings.append(f"High correlation ({corr:.2f}): {col}")
return {
'warnings': warnings,
'has_issues': len(warnings) > 0
}
def compute_sample_weights(self, holding_period: int = None) -> pd.Series:
"""Compute sample weights based on label uniqueness."""
if self.labels is None:
raise ValueError("Create labels first")
if holding_period is None:
holding_period = self.config.get('horizon', 1)
weights = pd.Series(1.0, index=self.labels.dropna().index)
for i, date in enumerate(weights.index):
start = date - pd.Timedelta(days=holding_period)
end = date + pd.Timedelta(days=holding_period)
concurrent = weights[(weights.index >= start) &
(weights.index <= end)]
weights.loc[date] = 1 / len(concurrent)
self.sample_weights = weights / weights.sum() * len(weights)
return self.sample_weights
def get_labels(self, dropna: bool = True) -> pd.Series:
"""Get the labels."""
if self.labels is None:
raise ValueError("Create labels first")
return self.labels.dropna() if dropna else self.labels
def get_summary(self) -> str:
"""Get labeling summary."""
if self.labels is None:
return "No labels created yet"
labels = self.labels.dropna()
summary = "Labeling System Summary\n" + "="*50 + "\n"
summary += f"Method: {self.config.get('method', 'unknown')}\n"
summary += f"Config: {self.config}\n"
summary += f"\nTotal labels: {len(labels)}\n"
summary += f"Label distribution:\n{labels.value_counts()}\n"
return summary
# Test
labeler = LabelingSystem(df)
labeler.create_labels('triple_barrier', take_profit=0.02, stop_loss=0.02, max_holding=5)
print(labeler.get_summary())
# Compute weights
weights = labeler.compute_sample_weights(holding_period=5)
print(f"\nSample weights computed: {len(weights)} samples")
Module Project: Target Labeling System
Build a comprehensive target engineering system.
# Module Project: Target Labeling System
import pandas as pd
import numpy as np
from typing import Dict, List, Optional, Union
class TargetEngineer:
"""
Complete target engineering system for financial ML.
Supports multiple labeling methods and provides quality analysis.
"""
def __init__(self, df: pd.DataFrame):
"""
Initialize with price data.
Args:
df: DataFrame with OHLCV columns
"""
self.df = df.copy()
self.targets = {}
self.quality_metrics = {}
def create_direction_target(self, name: str, horizon: int = 1) -> pd.Series:
"""
Create binary direction target.
Args:
name: Name for this target
horizon: Days ahead to predict
Returns:
Series with binary labels
"""
future_return = self.df['Close'].pct_change(horizon).shift(-horizon)
target = (future_return > 0).astype(int)
self.targets[name] = target
return target
def create_triple_barrier_target(
self,
name: str,
profit_target: float = 0.02,
stop_loss: float = 0.02,
max_holding: int = 10
) -> pd.Series:
"""
Create triple barrier target.
Args:
name: Name for this target
profit_target: Take profit level
stop_loss: Stop loss level
max_holding: Maximum holding period
Returns:
Series with labels (-1, 0, 1)
"""
labels = pd.Series(index=self.df.index, dtype=float)
for i in range(len(self.df) - max_holding):
entry = self.df['Close'].iloc[i]
upper = entry * (1 + profit_target)
lower = entry * (1 - stop_loss)
label = 0
for j in range(1, max_holding + 1):
if self.df['High'].iloc[i + j] >= upper:
label = 1
break
if self.df['Low'].iloc[i + j] <= lower:
label = -1
break
labels.iloc[i] = label
self.targets[name] = labels
return labels
def create_threshold_target(
self,
name: str,
horizon: int = 5,
threshold: float = 0.02
) -> pd.Series:
"""
Create threshold-based target.
Only labels significant moves.
Args:
name: Name for this target
horizon: Days ahead
threshold: Minimum move to label
Returns:
Series with labels (-1, 0, 1)
"""
future_return = self.df['Close'].pct_change(horizon).shift(-horizon)
target = pd.Series(0, index=self.df.index)
target[future_return > threshold] = 1
target[future_return < -threshold] = -1
self.targets[name] = target
return target
def create_meta_target(
self,
name: str,
primary_signal: pd.Series,
holding_period: int = 5
) -> pd.Series:
"""
Create meta-labels for a primary signal.
Args:
name: Name for this target
primary_signal: Primary model signals (1, -1, 0)
holding_period: Period to evaluate profit
Returns:
Series with meta-labels (1 = profitable, 0 = not)
"""
forward_return = self.df['Close'].pct_change(holding_period).shift(-holding_period)
profit = primary_signal * forward_return
meta_labels = (profit > 0).astype(int)
meta_labels = meta_labels.where(primary_signal != 0)
self.targets[name] = meta_labels
return meta_labels
def analyze_target(self, name: str) -> Dict:
"""
Analyze quality of a target.
Args:
name: Target name to analyze
Returns:
Dictionary with quality metrics
"""
if name not in self.targets:
raise ValueError(f"Target '{name}' not found")
target = self.targets[name].dropna()
metrics = {
'total_samples': len(target),
'class_distribution': target.value_counts().to_dict(),
'class_percentages': (target.value_counts(normalize=True) * 100).round(2).to_dict(),
'unique_values': target.nunique(),
'date_range': (str(target.index[0].date()), str(target.index[-1].date()))
}
# Calculate imbalance
counts = target.value_counts()
metrics['imbalance_ratio'] = counts.max() / counts.min()
self.quality_metrics[name] = metrics
return metrics
def compute_sample_weights(self, name: str, holding_period: int = 1) -> pd.Series:
"""
Compute sample weights based on uniqueness.
Args:
name: Target name
holding_period: Period for overlap calculation
Returns:
Series with sample weights
"""
target = self.targets[name].dropna()
weights = pd.Series(1.0, index=target.index)
for date in target.index:
start = date - pd.Timedelta(days=holding_period)
end = date + pd.Timedelta(days=holding_period)
concurrent = target[(target.index >= start) & (target.index <= end)]
weights.loc[date] = 1 / len(concurrent)
# Normalize
weights = weights / weights.sum() * len(weights)
return weights
def get_target(self, name: str, dropna: bool = True) -> pd.Series:
"""
Get a target by name.
Args:
name: Target name
dropna: Whether to drop NaN values
Returns:
Target Series
"""
if name not in self.targets:
raise ValueError(f"Target '{name}' not found")
return self.targets[name].dropna() if dropna else self.targets[name]
def list_targets(self) -> List[str]:
"""List all created targets."""
return list(self.targets.keys())
def get_summary(self) -> str:
"""
Get summary of all targets.
Returns:
Formatted summary string
"""
summary = ["Target Engineering Summary", "="*50]
if not self.targets:
summary.append("No targets created yet.")
return "\n".join(summary)
for name in self.targets:
metrics = self.analyze_target(name)
summary.append(f"\n{name}:")
summary.append(f" Samples: {metrics['total_samples']}")
summary.append(f" Classes: {metrics['class_distribution']}")
summary.append(f" Imbalance: {metrics['imbalance_ratio']:.2f}")
return "\n".join(summary)
# Demo the target engineering system
print("Target Engineering System Demo")
print("="*60)
# Initialize
engineer = TargetEngineer(df)
# Create different targets
engineer.create_direction_target('direction_1d', horizon=1)
engineer.create_direction_target('direction_5d', horizon=5)
engineer.create_triple_barrier_target('triple_barrier', profit_target=0.02, stop_loss=0.02)
engineer.create_threshold_target('threshold', horizon=5, threshold=0.02)
# Create meta-label
primary_signal = pd.Series(
np.where(df['Close'] > df['Close'].rolling(20).mean(), 1, -1),
index=df.index
)
engineer.create_meta_target('meta', primary_signal, holding_period=5)
# Print summary
print(engineer.get_summary())
# Get sample weights
weights = engineer.compute_sample_weights('triple_barrier', holding_period=5)
print(f"\nSample weights computed for triple_barrier")
print(f" Mean weight: {weights.mean():.3f}")
print(f" Weight range: {weights.min():.3f} - {weights.max():.3f}")
Key Takeaways
-
Target Choice Matters: The prediction target determines what the model learns; choose carefully
-
Triple Barrier Method: More realistic than simple direction labels; mirrors actual trading exits
-
Meta-Labeling: Separates signal generation from signal filtering; improves precision
-
Lookahead Bias: The #1 killer of backtests; always verify features don't use future information
-
Sample Weights: Account for overlapping labels to prevent overweighting similar samples
-
Class Balance: Trading labels are often imbalanced; monitor and address this
Next: Module 5 - Tree-Based Models
Learn how to build powerful prediction models using decision trees, random forests, and gradient boosting.
Module 5: Tree-Based Models
Part 2: Classification Models
| Duration | Exercises | Prerequisites |
|---|---|---|
| ~2.5 hours | 6 | Modules 1-4 |
Learning Objectives
By the end of this module, you will be able to: - Understand decision tree fundamentals and splitting criteria - Build and tune Random Forest classifiers for trading signals - Apply XGBoost and LightGBM for enhanced performance - Handle feature importance and model interpretation - Apply ensemble methods for robust predictions
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Dict, List, Tuple, Optional
import warnings
warnings.filterwarnings('ignore')
# Scikit-learn
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import TimeSeriesSplit, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Gradient boosting libraries
try:
import xgboost as xgb
HAS_XGBOOST = True
except ImportError:
HAS_XGBOOST = False
print("XGBoost not installed. Install with: pip install xgboost")
try:
import lightgbm as lgb
HAS_LIGHTGBM = True
except ImportError:
HAS_LIGHTGBM = False
print("LightGBM not installed. Install with: pip install lightgbm")
import yfinance as yf
print("Module 5: Tree-Based Models")
print("=" * 40)
Section 1: Decision Tree Fundamentals
Decision trees are the foundation of many powerful ensemble methods. They make predictions by recursively partitioning the feature space.
# Decision Tree Concepts
tree_concepts = """
DECISION TREE STRUCTURE
=======================
[Root Node]
RSI > 70?
/ \\
Yes No
/ \\
[Internal] [Internal]
Vol > 0.02? MACD > 0?
/ \\ / \\
[Leaf] [Leaf] [Leaf] [Leaf]
SELL HOLD BUY HOLD
KEY CONCEPTS:
-------------
1. Splitting Criteria
- Gini Impurity: How often a randomly chosen element would be incorrectly labeled
- Entropy: Measure of randomness/disorder in the data
- Information Gain: Reduction in entropy after a split
2. Tree Parameters
- max_depth: How deep the tree can grow
- min_samples_split: Minimum samples to create a split
- min_samples_leaf: Minimum samples required at leaf node
3. Advantages for Finance
- Interpretable: Can explain why a prediction was made
- Non-linear: Captures complex relationships
- Feature importance: Ranks feature usefulness
4. Disadvantages
- Prone to overfitting
- High variance (small data changes → different tree)
- Greedy algorithm (locally optimal, not globally)
"""
print(tree_concepts)
# Prepare data for tree models
def prepare_trading_data(symbol: str = "SPY", period: str = "2y") -> Tuple[pd.DataFrame, pd.Series]:
"""Prepare data with features and target for classification."""
# Fetch data
ticker = yf.Ticker(symbol)
df = ticker.history(period=period)
# Create features
df['returns'] = df['Close'].pct_change()
df['volatility'] = df['returns'].rolling(20).std()
df['momentum_5'] = df['Close'].pct_change(5)
df['momentum_20'] = df['Close'].pct_change(20)
# Moving averages
df['sma_5'] = df['Close'].rolling(5).mean()
df['sma_20'] = df['Close'].rolling(20).mean()
df['sma_50'] = df['Close'].rolling(50).mean()
# Distance from MAs
df['dist_sma5'] = (df['Close'] - df['sma_5']) / df['sma_5']
df['dist_sma20'] = (df['Close'] - df['sma_20']) / df['sma_20']
df['dist_sma50'] = (df['Close'] - df['sma_50']) / df['sma_50']
# RSI
delta = df['Close'].diff()
gain = (delta.where(delta > 0, 0)).rolling(14).mean()
loss = (-delta.where(delta < 0, 0)).rolling(14).mean()
rs = gain / loss
df['rsi'] = 100 - (100 / (1 + rs))
# Volume features
df['volume_ma'] = df['Volume'].rolling(20).mean()
df['volume_ratio'] = df['Volume'] / df['volume_ma']
# Target: next day direction
df['target'] = (df['returns'].shift(-1) > 0).astype(int)
# Clean
df = df.dropna()
features = ['volatility', 'momentum_5', 'momentum_20', 'dist_sma5',
'dist_sma20', 'dist_sma50', 'rsi', 'volume_ratio']
X = df[features]
y = df['target']
return X, y
# Prepare data
X, y = prepare_trading_data()
print(f"Dataset: {len(X)} samples, {len(X.columns)} features")
print(f"Target distribution: {y.value_counts().to_dict()}")
# Build a simple decision tree
# Time series split
split_idx = int(len(X) * 0.8)
X_train, X_test = X[:split_idx], X[split_idx:]
y_train, y_test = y[:split_idx], y[split_idx:]
# Train decision tree with limited depth
dt_model = DecisionTreeClassifier(
max_depth=3, # Shallow tree to avoid overfitting
min_samples_split=20,
min_samples_leaf=10,
random_state=42
)
dt_model.fit(X_train, y_train)
# Evaluate
train_acc = dt_model.score(X_train, y_train)
test_acc = dt_model.score(X_test, y_test)
print(f"Decision Tree Results:")
print(f" Train Accuracy: {train_acc:.2%}")
print(f" Test Accuracy: {test_acc:.2%}")
print(f" Overfit Gap: {train_acc - test_acc:.2%}")
# Visualize the decision tree
plt.figure(figsize=(20, 10))
plot_tree(
dt_model,
feature_names=X.columns.tolist(),
class_names=['Down', 'Up'],
filled=True,
rounded=True,
fontsize=10
)
plt.title('Decision Tree for Trading Signal Classification')
plt.tight_layout()
plt.show()
# Feature importance from decision tree
importance_df = pd.DataFrame({
'feature': X.columns,
'importance': dt_model.feature_importances_
}).sort_values('importance', ascending=True)
plt.figure(figsize=(10, 6))
plt.barh(importance_df['feature'], importance_df['importance'])
plt.xlabel('Feature Importance')
plt.title('Decision Tree Feature Importance')
plt.tight_layout()
plt.show()
print("\nFeature Importance Ranking:")
for _, row in importance_df.iloc[::-1].iterrows():
print(f" {row['feature']:15s}: {row['importance']:.4f}")
Section 2: Random Forest
Random Forest combines multiple decision trees using bagging and feature randomization to reduce overfitting and variance.
# Random Forest Concepts
rf_concepts = """
RANDOM FOREST
=============
Key Ideas:
----------
1. Bootstrap Aggregating (Bagging)
- Train each tree on a random sample of data (with replacement)
- Reduces variance without increasing bias
2. Feature Randomization
- Each split considers only a random subset of features
- Decorrelates trees, improving ensemble performance
3. Aggregation
- Classification: Majority vote across all trees
- Regression: Average of all tree predictions
Parameters:
-----------
- n_estimators: Number of trees (more is usually better, diminishing returns)
- max_features: Features to consider at each split ('sqrt', 'log2', or int)
- max_depth: Maximum tree depth (None = fully grown)
- min_samples_split: Minimum samples to make a split
- min_samples_leaf: Minimum samples at leaf nodes
Advantages:
-----------
+ Robust to overfitting (compared to single tree)
+ Handles missing values well
+ Provides feature importance
+ Out-of-bag (OOB) error estimate
+ Parallelizable
Disadvantages:
--------------
- Less interpretable than single tree
- Memory intensive (stores all trees)
- Slower prediction than single tree
"""
print(rf_concepts)
# Build a Random Forest classifier
rf_model = RandomForestClassifier(
n_estimators=100, # Number of trees
max_depth=5, # Limit depth to prevent overfitting
min_samples_split=20,
min_samples_leaf=10,
max_features='sqrt', # sqrt(n_features) at each split
oob_score=True, # Calculate out-of-bag score
random_state=42,
n_jobs=-1 # Use all cores
)
rf_model.fit(X_train, y_train)
# Evaluate
train_acc = rf_model.score(X_train, y_train)
test_acc = rf_model.score(X_test, y_test)
oob_acc = rf_model.oob_score_
print(f"Random Forest Results:")
print(f" Train Accuracy: {train_acc:.2%}")
print(f" Test Accuracy: {test_acc:.2%}")
print(f" OOB Accuracy: {oob_acc:.2%}")
print(f" Overfit Gap: {train_acc - test_acc:.2%}")
# Compare single tree vs Random Forest
print("\nComparison:")
print(f"{'Model':<20} {'Train Acc':<12} {'Test Acc':<12} {'Gap':<10}")
print("-" * 54)
dt_train = dt_model.score(X_train, y_train)
dt_test = dt_model.score(X_test, y_test)
print(f"{'Decision Tree':<20} {dt_train:<12.2%} {dt_test:<12.2%} {dt_train-dt_test:<10.2%}")
rf_train = rf_model.score(X_train, y_train)
rf_test = rf_model.score(X_test, y_test)
print(f"{'Random Forest':<20} {rf_train:<12.2%} {rf_test:<12.2%} {rf_train-rf_test:<10.2%}")
# Random Forest feature importance
rf_importance = pd.DataFrame({
'feature': X.columns,
'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=True)
plt.figure(figsize=(10, 6))
plt.barh(rf_importance['feature'], rf_importance['importance'], color='forestgreen')
plt.xlabel('Feature Importance')
plt.title('Random Forest Feature Importance')
plt.tight_layout()
plt.show()
# Exercise 5.1: Random Forest Tuner (Guided)
def tune_random_forest(X_train: pd.DataFrame, y_train: pd.Series,
n_estimators_list: List[int] = [50, 100, 200],
max_depth_list: List[int] = [3, 5, 7]) -> Dict:
"""
Tune Random Forest hyperparameters using time series cross-validation.
Returns:
Dictionary with best parameters and scores
"""
# TODO: Create time series cross-validator with 5 splits
tscv = ______(n_splits=______)
best_score = -1
best_params = {}
results = []
for n_est in n_estimators_list:
for depth in max_depth_list:
# TODO: Create Random Forest with current parameters
model = ______(
n_estimators=______,
max_depth=______,
min_samples_leaf=10,
random_state=42,
n_jobs=-1
)
# TODO: Get cross-validation scores
scores = ______(model, X_train, y_train, cv=tscv, scoring='accuracy')
mean_score = scores.______()
results.append({
'n_estimators': n_est,
'max_depth': depth,
'mean_cv_score': mean_score,
'std_cv_score': scores.std()
})
if mean_score > best_score:
best_score = mean_score
best_params = {'n_estimators': n_est, 'max_depth': depth}
return {
'best_params': best_params,
'best_score': best_score,
'all_results': pd.DataFrame(results)
}
# Test the function
# tuning_results = tune_random_forest(X_train, y_train)
Solution 5.1
def tune_random_forest(X_train: pd.DataFrame, y_train: pd.Series,
n_estimators_list: List[int] = [50, 100, 200],
max_depth_list: List[int] = [3, 5, 7]) -> Dict:
"""
Tune Random Forest hyperparameters using time series cross-validation.
"""
tscv = TimeSeriesSplit(n_splits=5)
best_score = -1
best_params = {}
results = []
for n_est in n_estimators_list:
for depth in max_depth_list:
model = RandomForestClassifier(
n_estimators=n_est,
max_depth=depth,
min_samples_leaf=10,
random_state=42,
n_jobs=-1
)
scores = cross_val_score(model, X_train, y_train, cv=tscv, scoring='accuracy')
mean_score = scores.mean()
results.append({
'n_estimators': n_est,
'max_depth': depth,
'mean_cv_score': mean_score,
'std_cv_score': scores.std()
})
if mean_score > best_score:
best_score = mean_score
best_params = {'n_estimators': n_est, 'max_depth': depth}
return {
'best_params': best_params,
'best_score': best_score,
'all_results': pd.DataFrame(results)
}
Section 3: Gradient Boosting (XGBoost)
Gradient Boosting builds trees sequentially, with each tree correcting the errors of the previous ones.
# Gradient Boosting Concepts
boosting_concepts = """
GRADIENT BOOSTING
=================
Key Idea:
---------
Build trees sequentially, where each tree learns from the errors of all previous trees.
Process:
--------
1. Start with initial prediction (e.g., mean)
2. Calculate residuals (errors)
3. Fit a tree to predict the residuals
4. Update predictions by adding tree * learning_rate
5. Repeat steps 2-4 for n_estimators iterations
XGBoost Advantages:
-------------------
- Regularization: L1 (lasso) and L2 (ridge) to prevent overfitting
- Parallel processing: Faster training
- Built-in cross-validation
- Handles missing values
- Tree pruning: Removes non-essential branches
Key Parameters:
---------------
- n_estimators: Number of boosting rounds
- learning_rate (eta): Step size shrinkage (0.01-0.3 typical)
- max_depth: Maximum tree depth (3-10 typical)
- subsample: Fraction of samples for each tree
- colsample_bytree: Fraction of features for each tree
- reg_alpha: L1 regularization
- reg_lambda: L2 regularization
Random Forest vs XGBoost:
-------------------------
| Aspect | Random Forest | XGBoost |
|---------------|--------------------|--------------------|n| Training | Parallel (bagging) | Sequential (boost) |
| Trees | Deep, independent | Shallow, dependent |
| Overfitting | Less prone | More prone |
| Tuning | Easier | More parameters |
| Performance | Good baseline | Often better |
"""
print(boosting_concepts)
# Scikit-learn's GradientBoostingClassifier
gbc_model = GradientBoostingClassifier(
n_estimators=100,
learning_rate=0.1,
max_depth=3,
min_samples_split=20,
min_samples_leaf=10,
subsample=0.8,
random_state=42
)
gbc_model.fit(X_train, y_train)
train_acc = gbc_model.score(X_train, y_train)
test_acc = gbc_model.score(X_test, y_test)
print(f"Gradient Boosting (sklearn) Results:")
print(f" Train Accuracy: {train_acc:.2%}")
print(f" Test Accuracy: {test_acc:.2%}")
# XGBoost implementation
if HAS_XGBOOST:
xgb_model = xgb.XGBClassifier(
n_estimators=100,
learning_rate=0.1,
max_depth=3,
min_child_weight=10,
subsample=0.8,
colsample_bytree=0.8,
reg_alpha=0.1, # L1 regularization
reg_lambda=1.0, # L2 regularization
random_state=42,
use_label_encoder=False,
eval_metric='logloss'
)
xgb_model.fit(X_train, y_train)
train_acc = xgb_model.score(X_train, y_train)
test_acc = xgb_model.score(X_test, y_test)
print(f"XGBoost Results:")
print(f" Train Accuracy: {train_acc:.2%}")
print(f" Test Accuracy: {test_acc:.2%}")
else:
print("XGBoost not available. Install with: pip install xgboost")
# XGBoost with early stopping
if HAS_XGBOOST:
# Create validation set for early stopping
val_idx = int(len(X_train) * 0.8)
X_train_sub, X_val = X_train[:val_idx], X_train[val_idx:]
y_train_sub, y_val = y_train[:val_idx], y_train[val_idx:]
xgb_early = xgb.XGBClassifier(
n_estimators=500, # High number, will stop early
learning_rate=0.05,
max_depth=3,
min_child_weight=10,
subsample=0.8,
colsample_bytree=0.8,
random_state=42,
use_label_encoder=False,
eval_metric='logloss',
early_stopping_rounds=20
)
xgb_early.fit(
X_train_sub, y_train_sub,
eval_set=[(X_val, y_val)],
verbose=False
)
print(f"Early Stopping Results:")
print(f" Best iteration: {xgb_early.best_iteration}")
print(f" Test Accuracy: {xgb_early.score(X_test, y_test):.2%}")
# Exercise 5.2: XGBoost Feature Importance Analyzer (Guided)
def analyze_xgb_importance(model, feature_names: List[str],
importance_type: str = 'gain') -> pd.DataFrame:
"""
Analyze XGBoost feature importance using different metrics.
importance_type: 'weight', 'gain', or 'cover'
- weight: Number of times feature is used in trees
- gain: Average improvement in accuracy when feature is used
- cover: Average number of samples affected by feature splits
"""
# TODO: Get importance scores from the model's booster
importance_dict = model.get_booster().______(importance_type=______)
# Create dataframe with feature names and importance
importance_df = pd.DataFrame([
{'feature': f, 'importance': importance_dict.get(f, 0)}
for f in feature_names
])
# TODO: Sort by importance descending
importance_df = importance_df.______(______, ascending=______)
# Normalize to percentages
total = importance_df['importance'].sum()
if total > 0:
importance_df['pct'] = importance_df['importance'] / total * 100
return importance_df
# Test the function
# if HAS_XGBOOST:
# importance = analyze_xgb_importance(xgb_model, X.columns.tolist())
Solution 5.2
def analyze_xgb_importance(model, feature_names: List[str],
importance_type: str = 'gain') -> pd.DataFrame:
"""
Analyze XGBoost feature importance using different metrics.
"""
importance_dict = model.get_booster().get_score(importance_type=importance_type)
importance_df = pd.DataFrame([
{'feature': f, 'importance': importance_dict.get(f, 0)}
for f in feature_names
])
importance_df = importance_df.sort_values('importance', ascending=False)
total = importance_df['importance'].sum()
if total > 0:
importance_df['pct'] = importance_df['importance'] / total * 100
return importance_df
Section 4: LightGBM
LightGBM is a highly efficient gradient boosting implementation that uses histogram-based learning.
# LightGBM Concepts
lgbm_concepts = """
LIGHTGBM
========
Key Innovations:
----------------
1. Histogram-based Learning
- Bins continuous features into discrete buckets
- Much faster than exact split finding
- Reduces memory usage
2. Leaf-wise Tree Growth
- Grows tree by splitting leaf with max gain
- More complex trees, better accuracy
- Prone to overfitting (use max_depth limit)
3. Gradient-based One-Side Sampling (GOSS)
- Keeps samples with large gradients
- Randomly samples from small gradients
- Faster training with minimal accuracy loss
XGBoost vs LightGBM:
--------------------
| Aspect | XGBoost | LightGBM |
|---------------|------------------|-------------------|
| Tree Growth | Level-wise | Leaf-wise |
| Speed | Good | Faster |
| Memory | Higher | Lower |
| Categoricals | Needs encoding | Native support |
| Overfitting | Less prone | More prone |
Key Parameters:
---------------
- num_leaves: Max leaves per tree (default 31)
- max_depth: Limit tree depth (-1 = unlimited)
- learning_rate: Step size (0.01-0.3)
- feature_fraction: Features per tree (like colsample_bytree)
- bagging_fraction: Samples per tree (like subsample)
"""
print(lgbm_concepts)
# LightGBM implementation
if HAS_LIGHTGBM:
lgb_model = lgb.LGBMClassifier(
n_estimators=100,
learning_rate=0.1,
num_leaves=31,
max_depth=5,
min_child_samples=20,
subsample=0.8,
colsample_bytree=0.8,
reg_alpha=0.1,
reg_lambda=1.0,
random_state=42,
verbose=-1
)
lgb_model.fit(X_train, y_train)
train_acc = lgb_model.score(X_train, y_train)
test_acc = lgb_model.score(X_test, y_test)
print(f"LightGBM Results:")
print(f" Train Accuracy: {train_acc:.2%}")
print(f" Test Accuracy: {test_acc:.2%}")
else:
print("LightGBM not available. Install with: pip install lightgbm")
# Compare all tree-based models
print("\n" + "="*60)
print("TREE-BASED MODEL COMPARISON")
print("="*60)
models = {
'Decision Tree': dt_model,
'Random Forest': rf_model,
'GradientBoosting': gbc_model
}
if HAS_XGBOOST:
models['XGBoost'] = xgb_model
if HAS_LIGHTGBM:
models['LightGBM'] = lgb_model
print(f"\n{'Model':<20} {'Train Acc':<12} {'Test Acc':<12} {'Gap':<10}")
print("-" * 54)
for name, model in models.items():
train_acc = model.score(X_train, y_train)
test_acc = model.score(X_test, y_test)
print(f"{name:<20} {train_acc:<12.2%} {test_acc:<12.2%} {train_acc-test_acc:<10.2%}")
# Exercise 5.3: LightGBM Trainer with Callbacks (Guided)
def train_lgb_with_callbacks(X_train: pd.DataFrame, y_train: pd.Series,
X_val: pd.DataFrame, y_val: pd.Series,
params: Dict = None) -> Tuple:
"""
Train LightGBM with early stopping and logging.
"""
if not HAS_LIGHTGBM:
raise ImportError("LightGBM not installed")
default_params = {
'n_estimators': 500,
'learning_rate': 0.05,
'num_leaves': 31,
'max_depth': 5,
'random_state': 42,
'verbose': -1
}
if params:
default_params.update(params)
# TODO: Create LightGBM classifier with parameters
model = lgb.______(**default_params)
# TODO: Fit with evaluation set and early stopping
model.______(X_train, y_train,
eval_set=[(______, ______)],
callbacks=[lgb.early_stopping(stopping_rounds=20, verbose=False)])
# Get best iteration
best_iter = model.best_iteration_
return model, best_iter
# Test the function
# model, best_iter = train_lgb_with_callbacks(X_train_sub, y_train_sub, X_val, y_val)
Solution 5.3
def train_lgb_with_callbacks(X_train: pd.DataFrame, y_train: pd.Series,
X_val: pd.DataFrame, y_val: pd.Series,
params: Dict = None) -> Tuple:
"""
Train LightGBM with early stopping and logging.
"""
if not HAS_LIGHTGBM:
raise ImportError("LightGBM not installed")
default_params = {
'n_estimators': 500,
'learning_rate': 0.05,
'num_leaves': 31,
'max_depth': 5,
'random_state': 42,
'verbose': -1
}
if params:
default_params.update(params)
model = lgb.LGBMClassifier(**default_params)
model.fit(X_train, y_train,
eval_set=[(X_val, y_val)],
callbacks=[lgb.early_stopping(stopping_rounds=20, verbose=False)])
best_iter = model.best_iteration_
return model, best_iter
Section 5: Ensemble Methods
Combining multiple models can produce more robust predictions than any single model.
# Ensemble Methods Overview
ensemble_concepts = """
ENSEMBLE METHODS
================
1. VOTING ENSEMBLES
- Hard Voting: Majority vote from all models
- Soft Voting: Average probabilities, then predict
2. STACKING
- Train base models on data
- Use base model predictions as features for meta-model
- Meta-model learns to combine predictions optimally
3. BLENDING
- Similar to stacking but uses holdout set
- Base models trained on training set
- Meta-model trained on holdout predictions
Why Ensembles Work:
-------------------
- Different models capture different patterns
- Errors from different models tend to cancel out
- More robust to overfitting
- Reduced variance in predictions
Best Practices:
---------------
- Use diverse base models (different algorithms)
- Each model should be better than random
- Models should make different types of errors
- Keep ensemble simple to avoid complexity
"""
print(ensemble_concepts)
# Voting Ensemble
from sklearn.ensemble import VotingClassifier
# Create base estimators
estimators = [
('rf', RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)),
('gbc', GradientBoostingClassifier(n_estimators=100, max_depth=3, random_state=42))
]
# Add XGBoost and LightGBM if available
if HAS_XGBOOST:
estimators.append(('xgb', xgb.XGBClassifier(
n_estimators=100, max_depth=3, random_state=42,
use_label_encoder=False, eval_metric='logloss'
)))
if HAS_LIGHTGBM:
estimators.append(('lgb', lgb.LGBMClassifier(
n_estimators=100, max_depth=5, random_state=42, verbose=-1
)))
# Create voting classifier
voting_clf = VotingClassifier(
estimators=estimators,
voting='soft' # Use predicted probabilities
)
voting_clf.fit(X_train, y_train)
train_acc = voting_clf.score(X_train, y_train)
test_acc = voting_clf.score(X_test, y_test)
print(f"Voting Ensemble Results:")
print(f" Train Accuracy: {train_acc:.2%}")
print(f" Test Accuracy: {test_acc:.2%}")
# Custom weighted ensemble
class WeightedEnsemble:
"""Custom weighted ensemble classifier."""
def __init__(self, models: List, weights: List[float] = None):
self.models = models
self.weights = weights or [1/len(models)] * len(models)
def fit(self, X: pd.DataFrame, y: pd.Series):
"""Fit all base models."""
for model in self.models:
model.fit(X, y)
return self
def predict_proba(self, X: pd.DataFrame) -> np.ndarray:
"""Weighted average of predicted probabilities."""
probas = np.zeros((len(X), 2))
for model, weight in zip(self.models, self.weights):
probas += weight * model.predict_proba(X)
return probas / sum(self.weights)
def predict(self, X: pd.DataFrame) -> np.ndarray:
"""Predict class labels."""
probas = self.predict_proba(X)
return (probas[:, 1] > 0.5).astype(int)
def score(self, X: pd.DataFrame, y: pd.Series) -> float:
"""Calculate accuracy."""
return accuracy_score(y, self.predict(X))
# Create weighted ensemble
base_models = [
RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42),
GradientBoostingClassifier(n_estimators=100, max_depth=3, random_state=42)
]
if HAS_XGBOOST:
base_models.append(xgb.XGBClassifier(
n_estimators=100, max_depth=3, random_state=42,
use_label_encoder=False, eval_metric='logloss'
))
# Weight models (higher weight for better performers)
weights = [0.3, 0.3, 0.4] if HAS_XGBOOST else [0.5, 0.5]
weighted_ensemble = WeightedEnsemble(base_models, weights)
weighted_ensemble.fit(X_train, y_train)
print(f"\nWeighted Ensemble Results:")
print(f" Test Accuracy: {weighted_ensemble.score(X_test, y_test):.2%}")
# Exercise 5.4: Complete Tree-Based Classifier System (Open-ended)
#
# Build a TreeBasedClassifier class that:
# - Supports multiple model types: 'dt', 'rf', 'xgb', 'lgb'
# - Has a tune() method for hyperparameter optimization
# - Has a fit() method that trains the selected model
# - Has a predict() and predict_proba() method
# - Has a get_feature_importance() method returning DataFrame
# - Handles missing XGBoost/LightGBM gracefully
#
# Your implementation:
Solution 5.4
class TreeBasedClassifier:
"""Unified interface for tree-based classifiers."""
SUPPORTED_MODELS = ['dt', 'rf', 'gbc', 'xgb', 'lgb']
def __init__(self, model_type: str = 'rf', **kwargs):
if model_type not in self.SUPPORTED_MODELS:
raise ValueError(f"Unsupported model: {model_type}")
self.model_type = model_type
self.params = kwargs
self.model = None
self.feature_names = None
def _create_model(self, params: Dict):
"""Create model instance based on type."""
if self.model_type == 'dt':
return DecisionTreeClassifier(**params)
elif self.model_type == 'rf':
return RandomForestClassifier(**params)
elif self.model_type == 'gbc':
return GradientBoostingClassifier(**params)
elif self.model_type == 'xgb':
if not HAS_XGBOOST:
raise ImportError("XGBoost not installed")
params.setdefault('use_label_encoder', False)
params.setdefault('eval_metric', 'logloss')
return xgb.XGBClassifier(**params)
elif self.model_type == 'lgb':
if not HAS_LIGHTGBM:
raise ImportError("LightGBM not installed")
params.setdefault('verbose', -1)
return lgb.LGBMClassifier(**params)
def tune(self, X: pd.DataFrame, y: pd.Series,
param_grid: Dict = None, cv: int = 5) -> Dict:
"""Tune hyperparameters using cross-validation."""
from sklearn.model_selection import GridSearchCV
if param_grid is None:
param_grid = {
'max_depth': [3, 5, 7],
'n_estimators': [50, 100] if self.model_type != 'dt' else [None]
}
base_model = self._create_model(self.params)
tscv = TimeSeriesSplit(n_splits=cv)
grid_search = GridSearchCV(
base_model, param_grid, cv=tscv, scoring='accuracy', n_jobs=-1
)
grid_search.fit(X, y)
self.params.update(grid_search.best_params_)
return grid_search.best_params_
def fit(self, X: pd.DataFrame, y: pd.Series):
"""Fit the model."""
self.feature_names = X.columns.tolist()
self.model = self._create_model(self.params)
self.model.fit(X, y)
return self
def predict(self, X: pd.DataFrame) -> np.ndarray:
"""Predict class labels."""
return self.model.predict(X)
def predict_proba(self, X: pd.DataFrame) -> np.ndarray:
"""Predict class probabilities."""
return self.model.predict_proba(X)
def get_feature_importance(self) -> pd.DataFrame:
"""Get feature importance as DataFrame."""
importance = self.model.feature_importances_
df = pd.DataFrame({
'feature': self.feature_names,
'importance': importance
}).sort_values('importance', ascending=False)
df['pct'] = df['importance'] / df['importance'].sum() * 100
return df
def score(self, X: pd.DataFrame, y: pd.Series) -> float:
"""Calculate accuracy."""
return accuracy_score(y, self.predict(X))
# Exercise 5.5: Stacking Ensemble Builder (Open-ended)
#
# Build a StackingEnsemble class that:
# - Takes a list of base models and a meta-model
# - Uses cross-validation to generate base model predictions
# - Trains meta-model on stacked predictions
# - Implements fit(), predict(), and predict_proba()
# - Returns individual model contributions
#
# Your implementation:
Solution 5.5
from sklearn.base import clone
class StackingEnsemble:
"""Stacking ensemble with customizable base and meta models."""
def __init__(self, base_models: List, meta_model, n_folds: int = 5):
self.base_models = [clone(m) for m in base_models]
self.meta_model = clone(meta_model)
self.n_folds = n_folds
self.fitted_base_models = []
def fit(self, X: pd.DataFrame, y: pd.Series):
"""Fit stacking ensemble."""
X_arr = X.values if isinstance(X, pd.DataFrame) else X
y_arr = y.values if isinstance(y, pd.Series) else y
n_samples = len(X_arr)
n_models = len(self.base_models)
# Out-of-fold predictions for meta features
meta_features = np.zeros((n_samples, n_models))
tscv = TimeSeriesSplit(n_splits=self.n_folds)
for model_idx, model in enumerate(self.base_models):
for train_idx, val_idx in tscv.split(X_arr):
cloned = clone(model)
cloned.fit(X_arr[train_idx], y_arr[train_idx])
# Store probability for positive class
meta_features[val_idx, model_idx] = cloned.predict_proba(X_arr[val_idx])[:, 1]
# Train meta-model on stacked predictions
# Use only samples that have meta predictions (after first fold)
mask = meta_features.sum(axis=1) != 0
self.meta_model.fit(meta_features[mask], y_arr[mask])
# Refit base models on full data
self.fitted_base_models = []
for model in self.base_models:
fitted = clone(model)
fitted.fit(X_arr, y_arr)
self.fitted_base_models.append(fitted)
return self
def _get_meta_features(self, X: pd.DataFrame) -> np.ndarray:
"""Generate meta features from base model predictions."""
X_arr = X.values if isinstance(X, pd.DataFrame) else X
meta_features = np.column_stack([
model.predict_proba(X_arr)[:, 1]
for model in self.fitted_base_models
])
return meta_features
def predict(self, X: pd.DataFrame) -> np.ndarray:
"""Predict class labels."""
meta_features = self._get_meta_features(X)
return self.meta_model.predict(meta_features)
def predict_proba(self, X: pd.DataFrame) -> np.ndarray:
"""Predict class probabilities."""
meta_features = self._get_meta_features(X)
return self.meta_model.predict_proba(meta_features)
def get_base_contributions(self, X: pd.DataFrame) -> pd.DataFrame:
"""Get predictions from each base model."""
return pd.DataFrame(
self._get_meta_features(X),
columns=[f'model_{i}' for i in range(len(self.fitted_base_models))]
)
def score(self, X: pd.DataFrame, y: pd.Series) -> float:
"""Calculate accuracy."""
return accuracy_score(y, self.predict(X))
# Exercise 5.6: Model Selection Framework (Open-ended)
#
# Build a TreeModelSelector class that:
# - Automatically trains and compares multiple tree-based models
# - Uses proper time series cross-validation
# - Tracks training time, accuracy, and feature importance
# - Generates a comparison report
# - Recommends the best model based on test performance
# - Provides a plot comparing all models
#
# Your implementation:
Solution 5.6
import time
class TreeModelSelector:
"""Automated tree-based model selection."""
def __init__(self):
self.models = {}
self.results = {}
self.best_model = None
self.best_model_name = None
def _get_default_models(self) -> Dict:
"""Get default set of tree-based models."""
models = {
'DecisionTree': DecisionTreeClassifier(max_depth=5, random_state=42),
'RandomForest': RandomForestClassifier(
n_estimators=100, max_depth=5, random_state=42, n_jobs=-1
),
'GradientBoosting': GradientBoostingClassifier(
n_estimators=100, max_depth=3, random_state=42
)
}
if HAS_XGBOOST:
models['XGBoost'] = xgb.XGBClassifier(
n_estimators=100, max_depth=3, random_state=42,
use_label_encoder=False, eval_metric='logloss'
)
if HAS_LIGHTGBM:
models['LightGBM'] = lgb.LGBMClassifier(
n_estimators=100, max_depth=5, random_state=42, verbose=-1
)
return models
def fit(self, X_train: pd.DataFrame, y_train: pd.Series,
X_test: pd.DataFrame, y_test: pd.Series,
custom_models: Dict = None):
"""Train and evaluate all models."""
self.models = custom_models or self._get_default_models()
self.feature_names = X_train.columns.tolist()
for name, model in self.models.items():
print(f"Training {name}...")
start_time = time.time()
model.fit(X_train, y_train)
train_time = time.time() - start_time
train_acc = model.score(X_train, y_train)
test_acc = model.score(X_test, y_test)
# Get feature importance
importance = pd.DataFrame({
'feature': self.feature_names,
'importance': model.feature_importances_
}).sort_values('importance', ascending=False)
self.results[name] = {
'model': model,
'train_accuracy': train_acc,
'test_accuracy': test_acc,
'overfit_gap': train_acc - test_acc,
'train_time': train_time,
'feature_importance': importance
}
# Find best model by test accuracy
self.best_model_name = max(
self.results.keys(),
key=lambda x: self.results[x]['test_accuracy']
)
self.best_model = self.results[self.best_model_name]['model']
return self
def get_comparison_report(self) -> pd.DataFrame:
"""Generate comparison DataFrame."""
rows = []
for name, result in self.results.items():
rows.append({
'Model': name,
'Train Acc': f"{result['train_accuracy']:.2%}",
'Test Acc': f"{result['test_accuracy']:.2%}",
'Overfit Gap': f"{result['overfit_gap']:.2%}",
'Train Time (s)': f"{result['train_time']:.2f}"
})
df = pd.DataFrame(rows)
return df.sort_values('Test Acc', ascending=False)
def plot_comparison(self):
"""Plot model comparison."""
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Accuracy comparison
models = list(self.results.keys())
train_accs = [self.results[m]['train_accuracy'] for m in models]
test_accs = [self.results[m]['test_accuracy'] for m in models]
x = np.arange(len(models))
width = 0.35
axes[0].bar(x - width/2, train_accs, width, label='Train', alpha=0.8)
axes[0].bar(x + width/2, test_accs, width, label='Test', alpha=0.8)
axes[0].set_ylabel('Accuracy')
axes[0].set_xticks(x)
axes[0].set_xticklabels(models, rotation=45, ha='right')
axes[0].legend()
axes[0].set_title('Train vs Test Accuracy')
# Training time
times = [self.results[m]['train_time'] for m in models]
axes[1].bar(models, times, color='green', alpha=0.7)
axes[1].set_ylabel('Training Time (seconds)')
axes[1].set_xticklabels(models, rotation=45, ha='right')
axes[1].set_title('Training Time')
plt.tight_layout()
plt.show()
def recommend(self) -> str:
"""Return recommendation string."""
result = self.results[self.best_model_name]
return (
f"Recommended: {self.best_model_name}\n"
f" Test Accuracy: {result['test_accuracy']:.2%}\n"
f" Overfit Gap: {result['overfit_gap']:.2%}\n"
f" Top Features: {', '.join(result['feature_importance']['feature'].head(3).tolist())}"
)
Module Project: Complete Tree-Based Trading Signal System
Build a comprehensive system that uses tree-based models for trading signal generation.
class TreeBasedTradingSystem:
"""
Complete trading signal system using tree-based models.
Features:
- Multiple model support (RF, XGBoost, LightGBM)
- Ensemble predictions
- Feature importance analysis
- Signal generation with confidence
"""
def __init__(self, model_type: str = 'ensemble'):
"""
Initialize trading system.
Args:
model_type: 'rf', 'xgb', 'lgb', or 'ensemble'
"""
self.model_type = model_type
self.model = None
self.feature_names = None
self.scaler = StandardScaler()
def create_features(self, df: pd.DataFrame) -> pd.DataFrame:
"""Create trading features from OHLCV data."""
features = pd.DataFrame(index=df.index)
# Price features
features['returns'] = df['Close'].pct_change()
features['volatility'] = features['returns'].rolling(20).std()
# Momentum
for period in [5, 10, 20]:
features[f'momentum_{period}'] = df['Close'].pct_change(period)
# Moving average distances
for period in [5, 20, 50]:
ma = df['Close'].rolling(period).mean()
features[f'dist_ma{period}'] = (df['Close'] - ma) / ma
# RSI
delta = df['Close'].diff()
gain = (delta.where(delta > 0, 0)).rolling(14).mean()
loss = (-delta.where(delta < 0, 0)).rolling(14).mean()
rs = gain / loss
features['rsi'] = 100 - (100 / (1 + rs))
# Bollinger Band position
ma20 = df['Close'].rolling(20).mean()
std20 = df['Close'].rolling(20).std()
features['bb_position'] = (df['Close'] - ma20) / (2 * std20)
# Volume features
features['volume_ratio'] = df['Volume'] / df['Volume'].rolling(20).mean()
return features.dropna()
def _create_model(self):
"""Create model based on type."""
if self.model_type == 'rf':
return RandomForestClassifier(
n_estimators=100, max_depth=5, min_samples_leaf=10,
random_state=42, n_jobs=-1
)
elif self.model_type == 'xgb' and HAS_XGBOOST:
return xgb.XGBClassifier(
n_estimators=100, max_depth=3, learning_rate=0.1,
min_child_weight=10, random_state=42,
use_label_encoder=False, eval_metric='logloss'
)
elif self.model_type == 'lgb' and HAS_LIGHTGBM:
return lgb.LGBMClassifier(
n_estimators=100, max_depth=5, learning_rate=0.1,
min_child_samples=20, random_state=42, verbose=-1
)
elif self.model_type == 'ensemble':
estimators = [
('rf', RandomForestClassifier(
n_estimators=100, max_depth=5, random_state=42, n_jobs=-1
)),
('gbc', GradientBoostingClassifier(
n_estimators=100, max_depth=3, random_state=42
))
]
if HAS_XGBOOST:
estimators.append(('xgb', xgb.XGBClassifier(
n_estimators=100, max_depth=3, random_state=42,
use_label_encoder=False, eval_metric='logloss'
)))
return VotingClassifier(estimators=estimators, voting='soft')
else:
return RandomForestClassifier(
n_estimators=100, max_depth=5, random_state=42, n_jobs=-1
)
def fit(self, df: pd.DataFrame, test_size: float = 0.2):
"""
Fit the trading system.
Args:
df: OHLCV DataFrame
test_size: Fraction for testing
"""
# Create features
features = self.create_features(df)
# Align with original data and create target
aligned_df = df.loc[features.index]
target = (aligned_df['Close'].pct_change().shift(-1) > 0).astype(int)
# Remove last row (no target)
features = features[:-1]
target = target[:-1]
self.feature_names = features.columns.tolist()
# Split
split_idx = int(len(features) * (1 - test_size))
X_train = features[:split_idx]
X_test = features[split_idx:]
y_train = target[:split_idx]
y_test = target[split_idx:]
# Scale features
X_train_scaled = self.scaler.fit_transform(X_train)
X_test_scaled = self.scaler.transform(X_test)
# Train model
self.model = self._create_model()
self.model.fit(X_train_scaled, y_train)
# Evaluate
train_acc = self.model.score(X_train_scaled, y_train)
test_acc = self.model.score(X_test_scaled, y_test)
print(f"\nTraining Complete ({self.model_type})")
print(f" Train Accuracy: {train_acc:.2%}")
print(f" Test Accuracy: {test_acc:.2%}")
return self
def predict_signal(self, df: pd.DataFrame) -> pd.DataFrame:
"""
Generate trading signals with confidence.
Returns:
DataFrame with signal and confidence
"""
features = self.create_features(df)
X_scaled = self.scaler.transform(features)
# Get predictions and probabilities
predictions = self.model.predict(X_scaled)
probabilities = self.model.predict_proba(X_scaled)
# Create signals DataFrame
signals = pd.DataFrame(index=features.index)
signals['signal'] = predictions
signals['confidence'] = np.max(probabilities, axis=1)
signals['signal_name'] = signals['signal'].map({0: 'SELL', 1: 'BUY'})
return signals
def get_feature_importance(self) -> pd.DataFrame:
"""Get feature importance (works for non-ensemble models)."""
if hasattr(self.model, 'feature_importances_'):
importance = self.model.feature_importances_
elif hasattr(self.model, 'estimators_'):
# For VotingClassifier, average importance from tree-based estimators
importances = []
for name, est in self.model.named_estimators_.items():
if hasattr(est, 'feature_importances_'):
importances.append(est.feature_importances_)
importance = np.mean(importances, axis=0)
else:
return pd.DataFrame()
return pd.DataFrame({
'feature': self.feature_names,
'importance': importance
}).sort_values('importance', ascending=False)
def backtest_signals(self, df: pd.DataFrame) -> pd.DataFrame:
"""Simple backtest of trading signals."""
signals = self.predict_signal(df)
# Align returns
returns = df['Close'].pct_change().shift(-1)
aligned_returns = returns.loc[signals.index]
# Strategy returns
signals['next_return'] = aligned_returns
signals['strategy_return'] = signals['signal'].shift(1) * signals['next_return']
# Cumulative returns
signals['cumulative_strategy'] = (1 + signals['strategy_return'].fillna(0)).cumprod()
signals['cumulative_bh'] = (1 + signals['next_return'].fillna(0)).cumprod()
return signals
# Test the complete system
# Fetch data
ticker = yf.Ticker("SPY")
data = ticker.history(period="2y")
# Create and train system
trading_system = TreeBasedTradingSystem(model_type='ensemble')
trading_system.fit(data)
# Get feature importance
importance = trading_system.get_feature_importance()
print("\nTop Features:")
print(importance.head(5).to_string(index=False))
# Generate signals and backtest
backtest_results = trading_system.backtest_signals(data)
# Plot cumulative returns
plt.figure(figsize=(12, 6))
plt.plot(backtest_results['cumulative_strategy'].dropna(), label='Strategy', linewidth=2)
plt.plot(backtest_results['cumulative_bh'].dropna(), label='Buy & Hold', linewidth=2, alpha=0.7)
plt.xlabel('Date')
plt.ylabel('Cumulative Return')
plt.title('Tree-Based Trading Strategy Performance')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Calculate metrics
strategy_return = backtest_results['cumulative_strategy'].iloc[-1] - 1
bh_return = backtest_results['cumulative_bh'].iloc[-1] - 1
print(f"\nPerformance Summary:")
print(f" Strategy Return: {strategy_return:.2%}")
print(f" Buy & Hold Return: {bh_return:.2%}")
print(f" Outperformance: {strategy_return - bh_return:.2%}")
Key Takeaways
-
Decision Trees are interpretable but prone to overfitting; limit depth and use regularization
-
Random Forest reduces variance through bagging and feature randomization; a solid baseline for financial ML
-
XGBoost offers regularization and speed improvements; excellent for structured data
-
LightGBM is faster with histogram-based learning; watch for overfitting with leaf-wise growth
-
Ensemble methods (voting, stacking) often outperform single models by combining diverse predictions
-
Feature importance helps understand what drives predictions and can guide feature engineering
-
Always use time series cross-validation for financial data to prevent lookahead bias
Next: Module 6 - Other Classification Models (Logistic Regression, SVM, Neural Networks)
Module 6: Other Classification Models
Part 2: Classification Models
| Duration | Exercises | Prerequisites |
|---|---|---|
| ~2.5 hours | 6 | Modules 1-5 |
Learning Objectives
By the end of this module, you will be able to: - Apply logistic regression for probabilistic trading signals - Use Support Vector Machines for classification - Build neural network classifiers with sklearn and keras - Understand when to use each model type - Compare model performance on financial data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Dict, List, Tuple, Optional
import warnings
warnings.filterwarnings('ignore')
# Scikit-learn models
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import TimeSeriesSplit, cross_val_score
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.pipeline import Pipeline
import yfinance as yf
print("Module 6: Other Classification Models")
print("=" * 40)
# Prepare data for classification
def prepare_classification_data(symbol: str = "SPY", period: str = "2y") -> Tuple[pd.DataFrame, pd.Series]:
"""Prepare features and target for classification."""
ticker = yf.Ticker(symbol)
df = ticker.history(period=period)
# Features
df['returns'] = df['Close'].pct_change()
df['volatility'] = df['returns'].rolling(20).std()
df['momentum_5'] = df['Close'].pct_change(5)
df['momentum_20'] = df['Close'].pct_change(20)
# Moving averages
for period in [5, 20, 50]:
ma = df['Close'].rolling(period).mean()
df[f'dist_ma{period}'] = (df['Close'] - ma) / ma
# RSI
delta = df['Close'].diff()
gain = (delta.where(delta > 0, 0)).rolling(14).mean()
loss = (-delta.where(delta < 0, 0)).rolling(14).mean()
rs = gain / loss
df['rsi'] = 100 - (100 / (1 + rs))
# Volume
df['volume_ratio'] = df['Volume'] / df['Volume'].rolling(20).mean()
# Target
df['target'] = (df['returns'].shift(-1) > 0).astype(int)
df = df.dropna()
features = ['volatility', 'momentum_5', 'momentum_20', 'dist_ma5',
'dist_ma20', 'dist_ma50', 'rsi', 'volume_ratio']
return df[features], df['target']
# Load data
X, y = prepare_classification_data()
print(f"Data shape: {X.shape}")
print(f"Target distribution: {y.value_counts().to_dict()}")
# Train/test split (time series)
split_idx = int(len(X) * 0.8)
X_train, X_test = X[:split_idx], X[split_idx:]
y_train, y_test = y[:split_idx], y[split_idx:]
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Section 1: Logistic Regression
Logistic regression is a linear model for classification that outputs probabilities. Despite its simplicity, it often works well for financial data.
# Logistic Regression Concepts
logreg_concepts = """
LOGISTIC REGRESSION
===================
How It Works:
-------------
1. Compute linear combination: z = w0 + w1*x1 + w2*x2 + ...
2. Apply sigmoid function: P(y=1) = 1 / (1 + e^(-z))
3. Classify: y = 1 if P(y=1) > threshold else 0
Sigmoid Function:
-----------------
1.0 ─────────────────────
│ ╱
0.5 ├─────────────╳
│ ╱
0.0 ─────────────────────
-5 0 5
Key Features:
-------------
- Outputs probabilities (interpretable)
- Coefficients indicate feature importance and direction
- Can be regularized (L1/L2) to prevent overfitting
- Fast to train and predict
Regularization:
---------------
- L1 (Lasso): penalty='l1', leads to sparse coefficients
- L2 (Ridge): penalty='l2', shrinks coefficients (default)
- C parameter: Inverse of regularization strength (smaller = more regularization)
Advantages for Finance:
-----------------------
+ Probabilistic output (good for position sizing)
+ Interpretable coefficients
+ Fast and simple
+ Regularization handles multicollinearity
Limitations:
------------
- Assumes linear decision boundary
- May underfit complex patterns
- Sensitive to outliers
"""
print(logreg_concepts)
# Basic Logistic Regression
logreg = LogisticRegression(
penalty='l2', # L2 regularization
C=1.0, # Regularization strength
solver='lbfgs', # Optimization algorithm
max_iter=1000,
random_state=42
)
logreg.fit(X_train_scaled, y_train)
# Evaluate
train_acc = logreg.score(X_train_scaled, y_train)
test_acc = logreg.score(X_test_scaled, y_test)
print(f"Logistic Regression Results:")
print(f" Train Accuracy: {train_acc:.2%}")
print(f" Test Accuracy: {test_acc:.2%}")
# Analyze coefficients
coef_df = pd.DataFrame({
'feature': X.columns,
'coefficient': logreg.coef_[0],
'abs_coefficient': np.abs(logreg.coef_[0])
}).sort_values('abs_coefficient', ascending=False)
print("\nFeature Coefficients:")
print("(Positive = increases probability of UP, Negative = decreases)")
for _, row in coef_df.iterrows():
direction = "↑" if row['coefficient'] > 0 else "↓"
print(f" {direction} {row['feature']:15s}: {row['coefficient']:+.4f}")
print(f"\nIntercept: {logreg.intercept_[0]:.4f}")
# Visualize coefficients
plt.figure(figsize=(10, 6))
colors = ['green' if c > 0 else 'red' for c in coef_df['coefficient']]
plt.barh(coef_df['feature'], coef_df['coefficient'], color=colors)
plt.axvline(x=0, color='black', linewidth=0.5)
plt.xlabel('Coefficient Value')
plt.title('Logistic Regression Coefficients')
plt.tight_layout()
plt.show()
# Probability predictions
probabilities = logreg.predict_proba(X_test_scaled)
# Show distribution of probabilities
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.hist(probabilities[:, 1], bins=30, edgecolor='black', alpha=0.7)
plt.axvline(x=0.5, color='red', linestyle='--', label='Decision Boundary')
plt.xlabel('P(UP)')
plt.ylabel('Frequency')
plt.title('Distribution of Predicted Probabilities')
plt.legend()
plt.subplot(1, 2, 2)
# Separate by actual class
prob_up = probabilities[y_test == 1, 1]
prob_down = probabilities[y_test == 0, 1]
plt.hist(prob_up, bins=20, alpha=0.6, label='Actual UP', color='green')
plt.hist(prob_down, bins=20, alpha=0.6, label='Actual DOWN', color='red')
plt.xlabel('P(UP)')
plt.ylabel('Frequency')
plt.title('Probabilities by Actual Class')
plt.legend()
plt.tight_layout()
plt.show()
# Exercise 6.1: Regularization Comparison (Guided)
def compare_regularization(X_train: np.ndarray, y_train: pd.Series,
X_test: np.ndarray, y_test: pd.Series,
C_values: List[float] = [0.001, 0.01, 0.1, 1.0, 10.0]) -> pd.DataFrame:
"""
Compare logistic regression with different regularization strengths.
Returns:
DataFrame with C value, train accuracy, test accuracy, and coefficient stats
"""
results = []
for C in C_values:
# TODO: Create logistic regression with given C value
model = ______(
penalty='l2',
C=______,
max_iter=1000,
random_state=42
)
# TODO: Fit the model
model.______(X_train, y_train)
# TODO: Get train and test accuracy
train_acc = model.______(X_train, y_train)
test_acc = model.______(X_test, y_test)
# Coefficient statistics
coef_sum = np.sum(np.abs(model.coef_))
results.append({
'C': C,
'train_accuracy': train_acc,
'test_accuracy': test_acc,
'coef_sum': coef_sum
})
return pd.DataFrame(results)
# Test the function
# reg_results = compare_regularization(X_train_scaled, y_train, X_test_scaled, y_test)
Solution 6.1
def compare_regularization(X_train: np.ndarray, y_train: pd.Series,
X_test: np.ndarray, y_test: pd.Series,
C_values: List[float] = [0.001, 0.01, 0.1, 1.0, 10.0]) -> pd.DataFrame:
"""
Compare logistic regression with different regularization strengths.
"""
results = []
for C in C_values:
model = LogisticRegression(
penalty='l2',
C=C,
max_iter=1000,
random_state=42
)
model.fit(X_train, y_train)
train_acc = model.score(X_train, y_train)
test_acc = model.score(X_test, y_test)
coef_sum = np.sum(np.abs(model.coef_))
results.append({
'C': C,
'train_accuracy': train_acc,
'test_accuracy': test_acc,
'coef_sum': coef_sum
})
return pd.DataFrame(results)
Section 2: Support Vector Machines (SVM)
SVMs find the optimal hyperplane that separates classes with maximum margin.
# SVM Concepts
svm_concepts = """
SUPPORT VECTOR MACHINES
=======================
Key Concepts:
-------------
1. Maximum Margin Classifier
- Finds hyperplane that maximizes distance to nearest points
- Support vectors: Points closest to the decision boundary
2. Kernel Trick
- Maps data to higher dimension for non-linear separation
- Common kernels: linear, poly, rbf (Gaussian), sigmoid
Visualization (2D):
-------------------
○ ○ ← Class 1
○ ○
───────────────── ← Decision boundary
● ●
● ● ● ← Class 0
Key Parameters:
---------------
- C: Regularization (higher = less regularization)
- kernel: 'linear', 'rbf', 'poly', 'sigmoid'
- gamma: Kernel coefficient for 'rbf' (higher = more complex)
Advantages:
-----------
+ Effective in high dimensions
+ Works well with clear margins
+ Versatile with different kernels
Disadvantages:
--------------
- Slow for large datasets
- Sensitive to feature scaling
- Probability estimates can be unreliable
- Memory intensive
"""
print(svm_concepts)
# Linear SVM
svm_linear = SVC(
kernel='linear',
C=1.0,
probability=True, # Enable probability estimates
random_state=42
)
svm_linear.fit(X_train_scaled, y_train)
train_acc = svm_linear.score(X_train_scaled, y_train)
test_acc = svm_linear.score(X_test_scaled, y_test)
print(f"Linear SVM Results:")
print(f" Train Accuracy: {train_acc:.2%}")
print(f" Test Accuracy: {test_acc:.2%}")
print(f" Support Vectors: {len(svm_linear.support_)} / {len(X_train)}")
# RBF (Gaussian) SVM
svm_rbf = SVC(
kernel='rbf',
C=1.0,
gamma='scale', # 1 / (n_features * X.var())
probability=True,
random_state=42
)
svm_rbf.fit(X_train_scaled, y_train)
train_acc = svm_rbf.score(X_train_scaled, y_train)
test_acc = svm_rbf.score(X_test_scaled, y_test)
print(f"RBF SVM Results:")
print(f" Train Accuracy: {train_acc:.2%}")
print(f" Test Accuracy: {test_acc:.2%}")
print(f" Support Vectors: {len(svm_rbf.support_)} / {len(X_train)}")
# Compare different kernels
kernels = ['linear', 'rbf', 'poly', 'sigmoid']
results = []
for kernel in kernels:
model = SVC(kernel=kernel, C=1.0, probability=True, random_state=42)
model.fit(X_train_scaled, y_train)
results.append({
'kernel': kernel,
'train_acc': model.score(X_train_scaled, y_train),
'test_acc': model.score(X_test_scaled, y_test),
'n_sv': len(model.support_)
})
kernel_df = pd.DataFrame(results)
print("\nKernel Comparison:")
print(kernel_df.to_string(index=False))
# Exercise 6.2: SVM Hyperparameter Tuner (Guided)
def tune_svm(X_train: np.ndarray, y_train: pd.Series,
C_values: List[float] = [0.1, 1.0, 10.0],
gamma_values: List[str] = ['scale', 'auto'],
cv_folds: int = 5) -> Dict:
"""
Tune SVM hyperparameters using time series cross-validation.
Returns:
Dictionary with best parameters and all results
"""
# TODO: Create time series cross-validator
tscv = ______(n_splits=cv_folds)
best_score = -1
best_params = {}
all_results = []
for C in C_values:
for gamma in gamma_values:
# TODO: Create SVC with RBF kernel and current parameters
model = ______(
kernel='rbf',
C=______,
gamma=______,
random_state=42
)
# TODO: Get cross-validation scores
scores = ______(model, X_train, y_train, cv=tscv, scoring='accuracy')
mean_score = scores.mean()
all_results.append({
'C': C,
'gamma': gamma,
'mean_score': mean_score,
'std_score': scores.std()
})
if mean_score > best_score:
best_score = mean_score
best_params = {'C': C, 'gamma': gamma}
return {
'best_params': best_params,
'best_score': best_score,
'all_results': pd.DataFrame(all_results)
}
# Test the function
# svm_results = tune_svm(X_train_scaled, y_train)
Solution 6.2
def tune_svm(X_train: np.ndarray, y_train: pd.Series,
C_values: List[float] = [0.1, 1.0, 10.0],
gamma_values: List[str] = ['scale', 'auto'],
cv_folds: int = 5) -> Dict:
"""
Tune SVM hyperparameters using time series cross-validation.
"""
tscv = TimeSeriesSplit(n_splits=cv_folds)
best_score = -1
best_params = {}
all_results = []
for C in C_values:
for gamma in gamma_values:
model = SVC(
kernel='rbf',
C=C,
gamma=gamma,
random_state=42
)
scores = cross_val_score(model, X_train, y_train, cv=tscv, scoring='accuracy')
mean_score = scores.mean()
all_results.append({
'C': C,
'gamma': gamma,
'mean_score': mean_score,
'std_score': scores.std()
})
if mean_score > best_score:
best_score = mean_score
best_params = {'C': C, 'gamma': gamma}
return {
'best_params': best_params,
'best_score': best_score,
'all_results': pd.DataFrame(all_results)
}
Section 3: Neural Networks (MLP)
Multi-Layer Perceptrons (MLPs) are feedforward neural networks that can learn complex non-linear patterns.
# Neural Network Concepts
nn_concepts = """
NEURAL NETWORKS (MLP)
=====================
Architecture:
-------------
Input Layer Hidden Layers Output Layer
(x1) ──┬──→ (h1) ──┬──→ (h3) ──┬──→ (y)
│ ╳ │
(x2) ──┼──→ (h2) ──┼──→ (h4) ──┤
│ │ │
(x3) ──┴──→ ... ─┴──→ ... ──┘
Key Components:
---------------
1. Neurons: Apply weights, bias, and activation
2. Activation Functions: ReLU, tanh, sigmoid, softmax
3. Backpropagation: Update weights based on error
4. Optimizer: SGD, Adam, etc.
Activation Functions:
---------------------
- ReLU: max(0, x) - most common for hidden layers
- Sigmoid: 1/(1+e^-x) - for binary output
- Softmax: exp(x_i)/sum(exp(x)) - for multi-class
- Tanh: (e^x - e^-x)/(e^x + e^-x)
Key Parameters:
---------------
- hidden_layer_sizes: Tuple, e.g., (100, 50) for 2 layers
- activation: 'relu', 'tanh', 'logistic'
- solver: 'adam', 'sgd', 'lbfgs'
- alpha: L2 regularization
- learning_rate_init: Initial learning rate
- batch_size: Samples per gradient update
Advantages:
-----------
+ Learns complex non-linear patterns
+ Universal approximator
+ Can handle large feature spaces
Disadvantages:
--------------
- Prone to overfitting
- Requires careful tuning
- Black box (less interpretable)
- Needs more data than simpler models
"""
print(nn_concepts)
# Basic MLP Classifier
mlp = MLPClassifier(
hidden_layer_sizes=(64, 32), # Two hidden layers
activation='relu',
solver='adam',
alpha=0.001, # L2 regularization
batch_size=32,
learning_rate_init=0.001,
max_iter=500,
early_stopping=True,
validation_fraction=0.1,
random_state=42
)
mlp.fit(X_train_scaled, y_train)
train_acc = mlp.score(X_train_scaled, y_train)
test_acc = mlp.score(X_test_scaled, y_test)
print(f"MLP Classifier Results:")
print(f" Train Accuracy: {train_acc:.2%}")
print(f" Test Accuracy: {test_acc:.2%}")
print(f" Iterations: {mlp.n_iter_}")
print(f" Final Loss: {mlp.loss_:.4f}")
# Training loss curve
plt.figure(figsize=(10, 5))
plt.plot(mlp.loss_curve_, linewidth=2)
plt.xlabel('Iteration')
plt.ylabel('Loss')
plt.title('MLP Training Loss Curve')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Compare different architectures
architectures = [
(32,), # Shallow
(64, 32), # Two layers
(128, 64, 32), # Three layers
(256, 128, 64), # Wider
]
results = []
for hidden_sizes in architectures:
model = MLPClassifier(
hidden_layer_sizes=hidden_sizes,
activation='relu',
solver='adam',
alpha=0.001,
max_iter=500,
early_stopping=True,
random_state=42
)
model.fit(X_train_scaled, y_train)
results.append({
'architecture': str(hidden_sizes),
'train_acc': model.score(X_train_scaled, y_train),
'test_acc': model.score(X_test_scaled, y_test),
'iterations': model.n_iter_
})
arch_df = pd.DataFrame(results)
print("\nArchitecture Comparison:")
print(arch_df.to_string(index=False))
# Exercise 6.3: Neural Network Builder (Guided)
def build_nn_classifier(X_train: np.ndarray, y_train: pd.Series,
hidden_sizes: Tuple[int, ...] = (64, 32),
dropout_rate: float = 0.2,
learning_rate: float = 0.001) -> MLPClassifier:
"""
Build and train a neural network classifier.
Note: sklearn's MLP doesn't support dropout directly,
so we use alpha for regularization instead.
"""
# Approximate dropout effect with alpha
alpha = dropout_rate * 0.01
# TODO: Create MLP classifier with the given parameters
model = ______(
hidden_layer_sizes=______,
activation='relu',
solver='adam',
alpha=______,
learning_rate_init=______,
max_iter=500,
early_stopping=True,
validation_fraction=0.1,
random_state=42
)
# TODO: Fit the model
model.______(X_train, y_train)
return model
# Test the function
# nn_model = build_nn_classifier(X_train_scaled, y_train)
Solution 6.3
def build_nn_classifier(X_train: np.ndarray, y_train: pd.Series,
hidden_sizes: Tuple[int, ...] = (64, 32),
dropout_rate: float = 0.2,
learning_rate: float = 0.001) -> MLPClassifier:
"""
Build and train a neural network classifier.
"""
alpha = dropout_rate * 0.01
model = MLPClassifier(
hidden_layer_sizes=hidden_sizes,
activation='relu',
solver='adam',
alpha=alpha,
learning_rate_init=learning_rate,
max_iter=500,
early_stopping=True,
validation_fraction=0.1,
random_state=42
)
model.fit(X_train, y_train)
return model
Section 4: Model Comparison
Systematically compare all classification models on trading data.
# Comprehensive model comparison
def get_all_models() -> Dict:
"""Get dictionary of all classification models."""
return {
'Logistic (L2)': LogisticRegression(penalty='l2', C=1.0, max_iter=1000, random_state=42),
'Logistic (L1)': LogisticRegression(penalty='l1', C=1.0, solver='saga', max_iter=1000, random_state=42),
'SVM (Linear)': SVC(kernel='linear', C=1.0, probability=True, random_state=42),
'SVM (RBF)': SVC(kernel='rbf', C=1.0, probability=True, random_state=42),
'MLP (Small)': MLPClassifier(hidden_layer_sizes=(32,), max_iter=500, early_stopping=True, random_state=42),
'MLP (Medium)': MLPClassifier(hidden_layer_sizes=(64, 32), max_iter=500, early_stopping=True, random_state=42),
}
models = get_all_models()
comparison_results = []
print("Training and evaluating models...\n")
for name, model in models.items():
model.fit(X_train_scaled, y_train)
train_acc = model.score(X_train_scaled, y_train)
test_acc = model.score(X_test_scaled, y_test)
comparison_results.append({
'Model': name,
'Train Acc': train_acc,
'Test Acc': test_acc,
'Overfit Gap': train_acc - test_acc
})
comparison_df = pd.DataFrame(comparison_results).sort_values('Test Acc', ascending=False)
print(comparison_df.to_string(index=False))
# Visualize comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Accuracy comparison
x = np.arange(len(comparison_df))
width = 0.35
axes[0].bar(x - width/2, comparison_df['Train Acc'], width, label='Train', alpha=0.8)
axes[0].bar(x + width/2, comparison_df['Test Acc'], width, label='Test', alpha=0.8)
axes[0].set_ylabel('Accuracy')
axes[0].set_xticks(x)
axes[0].set_xticklabels(comparison_df['Model'], rotation=45, ha='right')
axes[0].legend()
axes[0].axhline(y=0.5, color='red', linestyle='--', alpha=0.5, label='Random')
axes[0].set_title('Train vs Test Accuracy')
# Overfitting comparison
colors = ['red' if g > 0.05 else 'green' for g in comparison_df['Overfit Gap']]
axes[1].bar(comparison_df['Model'], comparison_df['Overfit Gap'], color=colors, alpha=0.7)
axes[1].axhline(y=0, color='black', linewidth=0.5)
axes[1].set_ylabel('Overfit Gap (Train - Test)')
axes[1].set_xticklabels(comparison_df['Model'], rotation=45, ha='right')
axes[1].set_title('Overfitting Analysis')
plt.tight_layout()
plt.show()
# Exercise 6.4: Complete Classifier Comparison System (Open-ended)
#
# Build a ClassifierCompare class that:
# - Takes a dictionary of sklearn classifiers
# - Uses time series cross-validation to evaluate each
# - Tracks accuracy, precision, recall, and F1 score
# - Generates a comparison report DataFrame
# - Provides a plot_comparison() method for visualization
# - Recommends the best model with reasoning
#
# Your implementation:
Solution 6.4
from sklearn.metrics import precision_score, recall_score, f1_score
class ClassifierCompare:
"""Compare multiple classifiers systematically."""
def __init__(self, classifiers: Dict):
self.classifiers = classifiers
self.results = {}
self.fitted_models = {}
def evaluate(self, X_train: np.ndarray, y_train: pd.Series,
X_test: np.ndarray, y_test: pd.Series,
cv_folds: int = 5):
"""Evaluate all classifiers."""
tscv = TimeSeriesSplit(n_splits=cv_folds)
for name, clf in self.classifiers.items():
print(f"Evaluating {name}...")
# Cross-validation
cv_scores = cross_val_score(clf, X_train, y_train, cv=tscv, scoring='accuracy')
# Fit on full training set
clf.fit(X_train, y_train)
self.fitted_models[name] = clf
# Predictions
y_pred = clf.predict(X_test)
# Metrics
self.results[name] = {
'cv_accuracy_mean': cv_scores.mean(),
'cv_accuracy_std': cv_scores.std(),
'train_accuracy': clf.score(X_train, y_train),
'test_accuracy': clf.score(X_test, y_test),
'precision': precision_score(y_test, y_pred, zero_division=0),
'recall': recall_score(y_test, y_pred, zero_division=0),
'f1': f1_score(y_test, y_pred, zero_division=0)
}
return self
def get_report(self) -> pd.DataFrame:
"""Generate comparison report."""
rows = []
for name, metrics in self.results.items():
rows.append({
'Model': name,
'CV Acc': f"{metrics['cv_accuracy_mean']:.2%} +/- {metrics['cv_accuracy_std']:.2%}",
'Train Acc': f"{metrics['train_accuracy']:.2%}",
'Test Acc': f"{metrics['test_accuracy']:.2%}",
'Precision': f"{metrics['precision']:.2%}",
'Recall': f"{metrics['recall']:.2%}",
'F1': f"{metrics['f1']:.2%}"
})
return pd.DataFrame(rows)
def plot_comparison(self):
"""Visualize comparison."""
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
models = list(self.results.keys())
test_accs = [self.results[m]['test_accuracy'] for m in models]
f1_scores = [self.results[m]['f1'] for m in models]
x = np.arange(len(models))
axes[0].bar(x, test_accs, color='steelblue', alpha=0.8)
axes[0].set_ylabel('Test Accuracy')
axes[0].set_xticks(x)
axes[0].set_xticklabels(models, rotation=45, ha='right')
axes[0].axhline(y=0.5, color='red', linestyle='--')
axes[0].set_title('Test Accuracy Comparison')
axes[1].bar(x, f1_scores, color='forestgreen', alpha=0.8)
axes[1].set_ylabel('F1 Score')
axes[1].set_xticks(x)
axes[1].set_xticklabels(models, rotation=45, ha='right')
axes[1].set_title('F1 Score Comparison')
plt.tight_layout()
plt.show()
def recommend(self) -> str:
"""Recommend best model."""
best_name = max(self.results.keys(),
key=lambda x: self.results[x]['test_accuracy'])
best = self.results[best_name]
overfit = best['train_accuracy'] - best['test_accuracy']
reasoning = []
if best['test_accuracy'] > 0.52:
reasoning.append("Shows predictive signal above random")
if overfit < 0.05:
reasoning.append("Low overfitting gap")
if best['f1'] > 0.5:
reasoning.append("Balanced precision/recall")
return f"""Recommended: {best_name}
Test Accuracy: {best['test_accuracy']:.2%}
F1 Score: {best['f1']:.2%}
Reasoning: {'; '.join(reasoning) if reasoning else 'Best among options'}"""
# Exercise 6.5: Probability Calibration Analyzer (Open-ended)
#
# Build a ProbabilityCalibrator class that:
# - Takes a fitted classifier with predict_proba
# - Analyzes calibration using bins (reliability diagram)
# - Calculates Brier score
# - Implements isotonic or Platt scaling calibration
# - Compares calibrated vs uncalibrated probabilities
#
# Your implementation:
Solution 6.5
from sklearn.calibration import CalibratedClassifierCV, calibration_curve
from sklearn.metrics import brier_score_loss
class ProbabilityCalibrator:
"""Analyze and improve probability calibration."""
def __init__(self, classifier, n_bins: int = 10):
self.classifier = classifier
self.n_bins = n_bins
self.calibrated_clf = None
def analyze_calibration(self, X: np.ndarray, y: pd.Series) -> Dict:
"""Analyze probability calibration."""
probas = self.classifier.predict_proba(X)[:, 1]
# Calibration curve
prob_true, prob_pred = calibration_curve(y, probas, n_bins=self.n_bins)
# Brier score
brier = brier_score_loss(y, probas)
return {
'prob_true': prob_true,
'prob_pred': prob_pred,
'brier_score': brier,
'probabilities': probas
}
def calibrate(self, X: np.ndarray, y: pd.Series,
method: str = 'isotonic') -> 'ProbabilityCalibrator':
"""Apply calibration to the classifier."""
self.calibrated_clf = CalibratedClassifierCV(
self.classifier,
method=method, # 'isotonic' or 'sigmoid'
cv=3
)
self.calibrated_clf.fit(X, y)
return self
def compare(self, X: np.ndarray, y: pd.Series) -> pd.DataFrame:
"""Compare calibrated vs uncalibrated."""
uncal_probas = self.classifier.predict_proba(X)[:, 1]
cal_probas = self.calibrated_clf.predict_proba(X)[:, 1]
uncal_brier = brier_score_loss(y, uncal_probas)
cal_brier = brier_score_loss(y, cal_probas)
uncal_acc = accuracy_score(y, (uncal_probas > 0.5).astype(int))
cal_acc = accuracy_score(y, (cal_probas > 0.5).astype(int))
return pd.DataFrame({
'Metric': ['Brier Score', 'Accuracy'],
'Uncalibrated': [uncal_brier, uncal_acc],
'Calibrated': [cal_brier, cal_acc],
'Improvement': [uncal_brier - cal_brier, cal_acc - uncal_acc]
})
def plot_calibration(self, X: np.ndarray, y: pd.Series):
"""Plot calibration curves."""
uncal = self.analyze_calibration(X, y)
plt.figure(figsize=(10, 8))
# Perfect calibration line
plt.plot([0, 1], [0, 1], 'k--', label='Perfect calibration')
# Uncalibrated
plt.plot(uncal['prob_pred'], uncal['prob_true'],
's-', label=f"Uncalibrated (Brier: {uncal['brier_score']:.4f})")
# Calibrated
if self.calibrated_clf:
cal_probas = self.calibrated_clf.predict_proba(X)[:, 1]
cal_true, cal_pred = calibration_curve(y, cal_probas, n_bins=self.n_bins)
cal_brier = brier_score_loss(y, cal_probas)
plt.plot(cal_pred, cal_true, 'o-',
label=f"Calibrated (Brier: {cal_brier:.4f})")
plt.xlabel('Mean Predicted Probability')
plt.ylabel('Fraction of Positives')
plt.title('Probability Calibration Curve')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
# Exercise 6.6: Trading Signal Pipeline Builder (Open-ended)
#
# Build a TradingSignalPipeline class that:
# - Combines preprocessing, feature scaling, and classification
# - Supports multiple classifier backends (logreg, svm, mlp)
# - Generates signals with confidence levels
# - Implements fit(), predict(), and predict_proba()
# - Has a get_signal_strength() method returning -1 to +1
# - Provides interpretability info (coefficients or feature importance)
#
# Your implementation:
Solution 6.6
class TradingSignalPipeline:
"""Complete pipeline for trading signal generation."""
CLASSIFIERS = {
'logreg': LogisticRegression(max_iter=1000, random_state=42),
'svm': SVC(kernel='rbf', probability=True, random_state=42),
'mlp': MLPClassifier(hidden_layer_sizes=(64, 32), max_iter=500,
early_stopping=True, random_state=42)
}
def __init__(self, classifier_type: str = 'logreg'):
if classifier_type not in self.CLASSIFIERS:
raise ValueError(f"Unknown classifier: {classifier_type}")
self.classifier_type = classifier_type
self.scaler = StandardScaler()
self.classifier = self.CLASSIFIERS[classifier_type]
self.feature_names = None
def fit(self, X: pd.DataFrame, y: pd.Series):
"""Fit the pipeline."""
self.feature_names = X.columns.tolist()
X_scaled = self.scaler.fit_transform(X)
self.classifier.fit(X_scaled, y)
return self
def predict(self, X: pd.DataFrame) -> np.ndarray:
"""Predict class labels."""
X_scaled = self.scaler.transform(X)
return self.classifier.predict(X_scaled)
def predict_proba(self, X: pd.DataFrame) -> np.ndarray:
"""Predict class probabilities."""
X_scaled = self.scaler.transform(X)
return self.classifier.predict_proba(X_scaled)
def get_signal_strength(self, X: pd.DataFrame) -> np.ndarray:
"""Get signal strength from -1 (strong sell) to +1 (strong buy)."""
probas = self.predict_proba(X)
# Map [0, 1] to [-1, 1]
return (probas[:, 1] - 0.5) * 2
def get_signals(self, X: pd.DataFrame) -> pd.DataFrame:
"""Get detailed signal information."""
predictions = self.predict(X)
probabilities = self.predict_proba(X)
strength = self.get_signal_strength(X)
return pd.DataFrame({
'signal': predictions,
'signal_name': pd.Series(predictions).map({0: 'SELL', 1: 'BUY'}),
'prob_down': probabilities[:, 0],
'prob_up': probabilities[:, 1],
'strength': strength,
'confidence': np.abs(strength)
}, index=X.index)
def get_interpretability(self) -> Optional[pd.DataFrame]:
"""Get model interpretability info."""
if self.classifier_type == 'logreg':
return pd.DataFrame({
'feature': self.feature_names,
'coefficient': self.classifier.coef_[0]
}).sort_values('coefficient', key=abs, ascending=False)
elif self.classifier_type == 'mlp':
# Approximate importance from first layer weights
weights = np.abs(self.classifier.coefs_[0]).mean(axis=1)
return pd.DataFrame({
'feature': self.feature_names,
'importance': weights
}).sort_values('importance', ascending=False)
else:
return None
def score(self, X: pd.DataFrame, y: pd.Series) -> float:
"""Calculate accuracy."""
X_scaled = self.scaler.transform(X)
return self.classifier.score(X_scaled, y)
Module Project: Multi-Model Trading Signal Ensemble
Build a complete trading system that combines multiple classification models.
class MultiModelTradingSystem:
"""
Trading system combining multiple classification models.
Uses logistic regression, SVM, and neural network for robust predictions.
"""
def __init__(self):
self.scaler = StandardScaler()
self.models = {
'logreg': LogisticRegression(penalty='l2', C=1.0, max_iter=1000, random_state=42),
'svm': SVC(kernel='rbf', C=1.0, probability=True, random_state=42),
'mlp': MLPClassifier(hidden_layer_sizes=(64, 32), max_iter=500,
early_stopping=True, random_state=42)
}
self.weights = {'logreg': 0.3, 'svm': 0.3, 'mlp': 0.4}
self.feature_names = None
def create_features(self, df: pd.DataFrame) -> pd.DataFrame:
"""Create features from OHLCV data."""
features = pd.DataFrame(index=df.index)
# Price features
features['returns'] = df['Close'].pct_change()
features['volatility'] = features['returns'].rolling(20).std()
# Momentum
for period in [5, 10, 20]:
features[f'momentum_{period}'] = df['Close'].pct_change(period)
# Moving average distances
for period in [5, 20, 50]:
ma = df['Close'].rolling(period).mean()
features[f'dist_ma{period}'] = (df['Close'] - ma) / ma
# RSI
delta = df['Close'].diff()
gain = (delta.where(delta > 0, 0)).rolling(14).mean()
loss = (-delta.where(delta < 0, 0)).rolling(14).mean()
features['rsi'] = 100 - (100 / (1 + gain / loss))
features['rsi_normalized'] = (features['rsi'] - 50) / 50
# Volume
features['volume_ratio'] = df['Volume'] / df['Volume'].rolling(20).mean()
return features.dropna()
def fit(self, df: pd.DataFrame, test_size: float = 0.2):
"""Fit all models on the data."""
# Create features
features = self.create_features(df)
self.feature_names = features.columns.tolist()
# Create target
aligned_df = df.loc[features.index]
target = (aligned_df['Close'].pct_change().shift(-1) > 0).astype(int)
# Remove last row
features = features[:-1]
target = target[:-1]
# Split
split_idx = int(len(features) * (1 - test_size))
X_train = features[:split_idx]
X_test = features[split_idx:]
y_train = target[:split_idx]
y_test = target[split_idx:]
# Scale
X_train_scaled = self.scaler.fit_transform(X_train)
X_test_scaled = self.scaler.transform(X_test)
# Fit all models
print("Training models...\n")
for name, model in self.models.items():
model.fit(X_train_scaled, y_train)
train_acc = model.score(X_train_scaled, y_train)
test_acc = model.score(X_test_scaled, y_test)
print(f"{name:10s}: Train {train_acc:.2%}, Test {test_acc:.2%}")
# Ensemble performance
ensemble_pred = self._ensemble_predict(X_test_scaled)
ensemble_acc = accuracy_score(y_test, ensemble_pred)
print(f"{'Ensemble':10s}: Test {ensemble_acc:.2%}")
return self
def _ensemble_predict_proba(self, X: np.ndarray) -> np.ndarray:
"""Weighted average of probabilities."""
weighted_proba = np.zeros((len(X), 2))
for name, model in self.models.items():
weighted_proba += self.weights[name] * model.predict_proba(X)
return weighted_proba
def _ensemble_predict(self, X: np.ndarray) -> np.ndarray:
"""Ensemble prediction."""
proba = self._ensemble_predict_proba(X)
return (proba[:, 1] > 0.5).astype(int)
def predict(self, df: pd.DataFrame) -> pd.DataFrame:
"""Generate trading signals."""
features = self.create_features(df)
X_scaled = self.scaler.transform(features)
# Get individual predictions
signals = pd.DataFrame(index=features.index)
for name, model in self.models.items():
signals[f'{name}_prob'] = model.predict_proba(X_scaled)[:, 1]
signals[f'{name}_signal'] = model.predict(X_scaled)
# Ensemble
ensemble_proba = self._ensemble_predict_proba(X_scaled)
signals['ensemble_prob'] = ensemble_proba[:, 1]
signals['ensemble_signal'] = self._ensemble_predict(X_scaled)
# Signal strength and agreement
signals['strength'] = (signals['ensemble_prob'] - 0.5) * 2
signals['model_agreement'] = (
signals['logreg_signal'] +
signals['svm_signal'] +
signals['mlp_signal']
) / 3
return signals
def get_feature_importance(self) -> pd.DataFrame:
"""Get feature importance from logistic regression."""
return pd.DataFrame({
'feature': self.feature_names,
'coefficient': self.models['logreg'].coef_[0]
}).sort_values('coefficient', key=abs, ascending=False)
def backtest(self, df: pd.DataFrame) -> pd.DataFrame:
"""Simple backtest of the system."""
signals = self.predict(df)
# Get returns
returns = df['Close'].pct_change().shift(-1)
aligned_returns = returns.loc[signals.index]
# Strategy returns
signals['next_return'] = aligned_returns
signals['strategy_return'] = signals['ensemble_signal'].shift(1) * signals['next_return']
# Strength-weighted returns
signals['weighted_return'] = signals['strength'].shift(1) * signals['next_return']
# Cumulative
signals['cum_strategy'] = (1 + signals['strategy_return'].fillna(0)).cumprod()
signals['cum_weighted'] = (1 + signals['weighted_return'].fillna(0)).cumprod()
signals['cum_bh'] = (1 + signals['next_return'].fillna(0)).cumprod()
return signals
# Test the multi-model system
# Get data
ticker = yf.Ticker("SPY")
data = ticker.history(period="2y")
# Create and train system
system = MultiModelTradingSystem()
system.fit(data)
# Backtest and visualize
backtest = system.backtest(data)
# Plot results
fig, axes = plt.subplots(2, 1, figsize=(14, 10))
# Cumulative returns
axes[0].plot(backtest['cum_strategy'].dropna(), label='Binary Strategy', linewidth=2)
axes[0].plot(backtest['cum_weighted'].dropna(), label='Weighted Strategy', linewidth=2)
axes[0].plot(backtest['cum_bh'].dropna(), label='Buy & Hold', linewidth=2, alpha=0.7)
axes[0].set_ylabel('Cumulative Return')
axes[0].set_title('Multi-Model Trading System Performance')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# Model agreement
axes[1].plot(backtest['model_agreement'].dropna(), alpha=0.7, linewidth=1)
axes[1].axhline(y=0.5, color='gray', linestyle='--')
axes[1].fill_between(backtest.index, 0, 1, where=backtest['model_agreement'] > 0.66,
alpha=0.3, color='green', label='Strong Buy')
axes[1].fill_between(backtest.index, 0, 1, where=backtest['model_agreement'] < 0.33,
alpha=0.3, color='red', label='Strong Sell')
axes[1].set_ylabel('Model Agreement (0-1)')
axes[1].set_title('Model Agreement Over Time')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Performance summary
strategy_return = backtest['cum_strategy'].iloc[-1] - 1
weighted_return = backtest['cum_weighted'].iloc[-1] - 1
bh_return = backtest['cum_bh'].iloc[-1] - 1
print("\nPerformance Summary:")
print(f" Binary Strategy: {strategy_return:.2%}")
print(f" Weighted Strategy: {weighted_return:.2%}")
print(f" Buy & Hold: {bh_return:.2%}")
print(f"\n Outperformance (Binary): {strategy_return - bh_return:.2%}")
print(f" Outperformance (Weighted): {weighted_return - bh_return:.2%}")
# Feature importance
print("\nTop Features (from Logistic Regression):")
importance = system.get_feature_importance()
print(importance.head(5).to_string(index=False))
Key Takeaways
-
Logistic Regression provides interpretable coefficients and probability outputs; regularization (L1/L2) prevents overfitting
-
Support Vector Machines find maximum-margin boundaries; kernels (RBF, polynomial) capture non-linear patterns
-
Neural Networks (MLP) learn complex patterns but require careful tuning and more data to avoid overfitting
-
Feature scaling is crucial for SVM and neural networks; always scale before training
-
Probability calibration matters for trading; well-calibrated probabilities improve position sizing
-
Model ensembles often outperform individual models by combining diverse perspectives
-
No single best model exists; the right choice depends on data characteristics and interpretability needs
Next: Module 7 - Model Evaluation (Classification metrics, financial metrics, ROC curves)
Module 7: Model Evaluation
Part 2: Classification Models
| Duration | Exercises | Prerequisites |
|---|---|---|
| ~2.5 hours | 6 | Modules 1-6 |
Learning Objectives
By the end of this module, you will be able to: - Calculate and interpret classification metrics (accuracy, precision, recall, F1) - Use confusion matrices for detailed error analysis - Apply ROC curves and AUC for threshold-independent evaluation - Evaluate models with financial metrics (returns, Sharpe ratio) - Implement proper walk-forward validation for realistic assessment
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Dict, List, Tuple, Optional
import warnings
warnings.filterwarnings('ignore')
# Scikit-learn metrics
from sklearn.metrics import (
accuracy_score, precision_score, recall_score, f1_score,
confusion_matrix, classification_report,
roc_curve, auc, roc_auc_score,
precision_recall_curve, average_precision_score
)
from sklearn.model_selection import TimeSeriesSplit
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
import yfinance as yf
print("Module 7: Model Evaluation")
print("=" * 40)
# Prepare data and train a model for evaluation
def prepare_data(symbol: str = "SPY", period: str = "2y") -> Tuple:
"""Prepare features, target, and train/test splits."""
ticker = yf.Ticker(symbol)
df = ticker.history(period=period)
# Features
df['returns'] = df['Close'].pct_change()
df['volatility'] = df['returns'].rolling(20).std()
df['momentum_5'] = df['Close'].pct_change(5)
df['momentum_20'] = df['Close'].pct_change(20)
for period_len in [5, 20, 50]:
ma = df['Close'].rolling(period_len).mean()
df[f'dist_ma{period_len}'] = (df['Close'] - ma) / ma
delta = df['Close'].diff()
gain = (delta.where(delta > 0, 0)).rolling(14).mean()
loss = (-delta.where(delta < 0, 0)).rolling(14).mean()
df['rsi'] = 100 - (100 / (1 + gain / loss))
df['volume_ratio'] = df['Volume'] / df['Volume'].rolling(20).mean()
df['target'] = (df['returns'].shift(-1) > 0).astype(int)
df = df.dropna()
features = ['volatility', 'momentum_5', 'momentum_20', 'dist_ma5',
'dist_ma20', 'dist_ma50', 'rsi', 'volume_ratio']
X = df[features]
y = df['target']
split_idx = int(len(X) * 0.8)
X_train, X_test = X[:split_idx], X[split_idx:]
y_train, y_test = y[:split_idx], y[split_idx:]
return X_train, X_test, y_train, y_test, df
# Load data
X_train, X_test, y_train, y_test, df = prepare_data()
# Scale
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train model
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
model.fit(X_train_scaled, y_train)
# Get predictions
y_pred = model.predict(X_test_scaled)
y_proba = model.predict_proba(X_test_scaled)
print(f"Test samples: {len(y_test)}")
print(f"Predictions shape: {y_pred.shape}")
Section 1: Classification Metrics
Understanding the fundamental metrics for evaluating classification models.
# Classification Metrics Overview
metrics_overview = """
CLASSIFICATION METRICS
======================
Confusion Matrix:
-----------------
Predicted
Neg Pos
Neg [ TN | FP ]
Actual Pos [ FN | TP ]
TN = True Negative: Correctly predicted DOWN
TP = True Positive: Correctly predicted UP
FP = False Positive: Predicted UP, was DOWN (Type I error)
FN = False Negative: Predicted DOWN, was UP (Type II error)
Key Metrics:
------------
Accuracy = (TP + TN) / (TP + TN + FP + FN)
→ Overall correctness
→ Misleading for imbalanced data
Precision = TP / (TP + FP)
→ "Of all predicted UP, how many were actually UP?"
→ High precision = few false alarms
→ Important when FP is costly (e.g., buying on wrong signal)
Recall = TP / (TP + FN)
→ "Of all actual UP days, how many did we catch?"
→ High recall = don't miss opportunities
→ Important when FN is costly (e.g., missing big moves)
F1 Score = 2 * (Precision * Recall) / (Precision + Recall)
→ Harmonic mean of precision and recall
→ Balanced measure when both matter
Trading Context:
----------------
High Precision Strategy: "Only trade when very confident"
High Recall Strategy: "Never miss a move"
Balanced (F1): "Reasonable trade-off"
"""
print(metrics_overview)
# Calculate basic metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print("Classification Metrics:")
print(f" Accuracy: {accuracy:.4f} ({accuracy:.2%})")
print(f" Precision: {precision:.4f} ({precision:.2%})")
print(f" Recall: {recall:.4f} ({recall:.2%})")
print(f" F1 Score: {f1:.4f} ({f1:.2%})")
# Random baseline
print(f"\n Random Baseline: 50.00%")
print(f" Improvement over random: {(accuracy - 0.5) * 100:.2f}pp")
# Full classification report
print("\nClassification Report:")
print("=" * 60)
print(classification_report(y_test, y_pred, target_names=['DOWN', 'UP']))
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=['Predicted DOWN', 'Predicted UP'],
yticklabels=['Actual DOWN', 'Actual UP'])
plt.title('Confusion Matrix')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.tight_layout()
plt.show()
# Extract values
tn, fp, fn, tp = cm.ravel()
print(f"\nConfusion Matrix Breakdown:")
print(f" True Negatives (DOWN → DOWN): {tn}")
print(f" False Positives (DOWN → UP): {fp} ← Wrong buy signals")
print(f" False Negatives (UP → DOWN): {fn} ← Missed opportunities")
print(f" True Positives (UP → UP): {tp}")
# Exercise 7.1: Metrics Calculator (Guided)
def calculate_all_metrics(y_true: pd.Series, y_pred: np.ndarray,
y_proba: np.ndarray = None) -> Dict:
"""
Calculate comprehensive classification metrics.
Returns:
Dictionary with all metrics
"""
# Basic metrics
metrics = {}
# TODO: Calculate accuracy, precision, recall, f1
metrics['accuracy'] = ______(y_true, y_pred)
metrics['precision'] = ______(y_true, y_pred)
metrics['recall'] = ______(y_true, y_pred)
metrics['f1'] = ______(y_true, y_pred)
# TODO: Get confusion matrix values
cm = ______(y_true, y_pred)
tn, fp, fn, tp = cm.______()
metrics['true_negatives'] = tn
metrics['false_positives'] = fp
metrics['false_negatives'] = fn
metrics['true_positives'] = tp
# Specificity (true negative rate)
metrics['specificity'] = tn / (tn + fp) if (tn + fp) > 0 else 0
# ROC AUC if probabilities provided
if y_proba is not None:
metrics['roc_auc'] = roc_auc_score(y_true, y_proba[:, 1])
return metrics
# Test the function
# metrics = calculate_all_metrics(y_test, y_pred, y_proba)
Solution 7.1
def calculate_all_metrics(y_true: pd.Series, y_pred: np.ndarray,
y_proba: np.ndarray = None) -> Dict:
"""
Calculate comprehensive classification metrics.
"""
metrics = {}
metrics['accuracy'] = accuracy_score(y_true, y_pred)
metrics['precision'] = precision_score(y_true, y_pred)
metrics['recall'] = recall_score(y_true, y_pred)
metrics['f1'] = f1_score(y_true, y_pred)
cm = confusion_matrix(y_true, y_pred)
tn, fp, fn, tp = cm.ravel()
metrics['true_negatives'] = tn
metrics['false_positives'] = fp
metrics['false_negatives'] = fn
metrics['true_positives'] = tp
metrics['specificity'] = tn / (tn + fp) if (tn + fp) > 0 else 0
if y_proba is not None:
metrics['roc_auc'] = roc_auc_score(y_true, y_proba[:, 1])
return metrics
Section 2: ROC Curves and AUC
ROC curves provide threshold-independent model evaluation.
# ROC Curve Concepts
roc_concepts = """
ROC CURVES AND AUC
==================
What is ROC?
------------
ROC = Receiver Operating Characteristic
- Plots True Positive Rate vs False Positive Rate at different thresholds
- Shows trade-off between catching positives and creating false alarms
Axes:
-----
Y-axis: True Positive Rate (TPR) = Recall = TP / (TP + FN)
X-axis: False Positive Rate (FPR) = FP / (FP + TN)
Interpretation:
---------------
- Diagonal line = random classifier (AUC = 0.5)
- Upper left corner = perfect classifier (AUC = 1.0)
- Curve above diagonal = better than random
AUC (Area Under Curve):
-----------------------
- 1.0: Perfect classifier
- 0.9-1.0: Excellent
- 0.8-0.9: Good
- 0.7-0.8: Fair
- 0.5-0.7: Poor
- 0.5: Random
Trading Context:
----------------
AUC 0.5: Model has no predictive power
AUC 0.55: Slight edge (may be profitable with good execution)
AUC 0.60+: Strong signal (rare in liquid markets)
"""
print(roc_concepts)
# Calculate and plot ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_proba[:, 1])
roc_auc = auc(fpr, tpr)
plt.figure(figsize=(10, 8))
# ROC curve
plt.plot(fpr, tpr, color='darkorange', lw=2,
label=f'ROC curve (AUC = {roc_auc:.4f})')
# Random baseline
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--',
label='Random (AUC = 0.500)')
# Formatting
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print(f"AUC: {roc_auc:.4f}")
print(f"Interpretation: {'Better than random' if roc_auc > 0.5 else 'No predictive power'}")
# Optimal threshold selection
# Youden's J statistic: TPR - FPR
j_scores = tpr - fpr
optimal_idx = np.argmax(j_scores)
optimal_threshold = thresholds[optimal_idx]
print(f"Optimal Threshold (Youden's J): {optimal_threshold:.4f}")
print(f" At this threshold:")
print(f" TPR (Recall): {tpr[optimal_idx]:.4f}")
print(f" FPR: {fpr[optimal_idx]:.4f}")
# Apply optimal threshold
y_pred_optimal = (y_proba[:, 1] >= optimal_threshold).astype(int)
print(f"\nWith optimal threshold:")
print(f" Accuracy: {accuracy_score(y_test, y_pred_optimal):.4f}")
print(f" Precision: {precision_score(y_test, y_pred_optimal):.4f}")
print(f" Recall: {recall_score(y_test, y_pred_optimal):.4f}")
# Precision-Recall Curve (better for imbalanced data)
precision_curve, recall_curve, pr_thresholds = precision_recall_curve(y_test, y_proba[:, 1])
avg_precision = average_precision_score(y_test, y_proba[:, 1])
plt.figure(figsize=(10, 8))
plt.plot(recall_curve, precision_curve, color='blue', lw=2,
label=f'PR curve (AP = {avg_precision:.4f})')
# Baseline (proportion of positive class)
baseline = y_test.mean()
plt.axhline(y=baseline, color='gray', linestyle='--',
label=f'Random baseline ({baseline:.4f})')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend(loc='lower left')
plt.grid(True, alpha=0.3)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.tight_layout()
plt.show()
# Exercise 7.2: ROC Analyzer (Guided)
def analyze_roc(y_true: pd.Series, y_proba: np.ndarray,
target_fpr: float = 0.1) -> Dict:
"""
Comprehensive ROC analysis.
Args:
y_true: True labels
y_proba: Predicted probabilities (n_samples, 2)
target_fpr: Target false positive rate for threshold
Returns:
Dictionary with ROC analysis results
"""
# TODO: Calculate ROC curve
fpr, tpr, thresholds = ______(y_true, y_proba[:, 1])
# TODO: Calculate AUC
roc_auc = ______(fpr, tpr)
# Youden's optimal threshold
j_scores = tpr - fpr
optimal_idx = np.______(j_scores)
optimal_threshold = thresholds[optimal_idx]
# Threshold for target FPR
target_idx = np.argmin(np.abs(fpr - target_fpr))
target_threshold = thresholds[target_idx]
return {
'auc': roc_auc,
'optimal_threshold': optimal_threshold,
'optimal_tpr': tpr[optimal_idx],
'optimal_fpr': fpr[optimal_idx],
'target_threshold': target_threshold,
'target_tpr': tpr[target_idx],
'fpr': fpr,
'tpr': tpr,
'thresholds': thresholds
}
# Test the function
# roc_analysis = analyze_roc(y_test, y_proba)
Solution 7.2
def analyze_roc(y_true: pd.Series, y_proba: np.ndarray,
target_fpr: float = 0.1) -> Dict:
"""
Comprehensive ROC analysis.
"""
fpr, tpr, thresholds = roc_curve(y_true, y_proba[:, 1])
roc_auc = auc(fpr, tpr)
j_scores = tpr - fpr
optimal_idx = np.argmax(j_scores)
optimal_threshold = thresholds[optimal_idx]
target_idx = np.argmin(np.abs(fpr - target_fpr))
target_threshold = thresholds[target_idx]
return {
'auc': roc_auc,
'optimal_threshold': optimal_threshold,
'optimal_tpr': tpr[optimal_idx],
'optimal_fpr': fpr[optimal_idx],
'target_threshold': target_threshold,
'target_tpr': tpr[target_idx],
'fpr': fpr,
'tpr': tpr,
'thresholds': thresholds
}
Section 3: Financial Metrics
ML metrics don't tell the whole story - we need financial performance metrics.
# Financial Metrics Overview
financial_metrics = """
FINANCIAL METRICS FOR ML MODELS
================================
Why Financial Metrics Matter:
-----------------------------
- High accuracy doesn't mean profits
- Correct predictions on small moves vs wrong on big moves
- Transaction costs and slippage
- Risk-adjusted returns matter
Key Financial Metrics:
----------------------
1. Total Return
Strategy return vs buy-and-hold
2. Sharpe Ratio
(Return - Risk Free) / Volatility
> 1.0 is good, > 2.0 is excellent
3. Maximum Drawdown
Largest peak-to-trough decline
Lower is better
4. Win Rate
Percentage of profitable trades
(Different from ML accuracy!)
5. Profit Factor
Gross Profit / Gross Loss
> 1.0 means profitable
6. Average Win/Loss Ratio
Average winning trade / Average losing trade
7. Calmar Ratio
Annual Return / Max Drawdown
Accuracy vs Profitability:
--------------------------
Model A: 60% accuracy, predicts small moves correctly
Model B: 45% accuracy, predicts big moves correctly
Model B can be MORE profitable!
"""
print(financial_metrics)
# Calculate financial metrics for our model
def calculate_financial_metrics(y_true: pd.Series, y_pred: np.ndarray,
returns: pd.Series, risk_free: float = 0.02) -> Dict:
"""
Calculate financial performance metrics.
Args:
y_true: True labels
y_pred: Predicted labels
returns: Actual returns series (aligned with predictions)
risk_free: Annual risk-free rate
"""
# Align data
pred_series = pd.Series(y_pred, index=y_true.index)
aligned_returns = returns.loc[y_true.index]
# Strategy returns (long when predicted up, flat when predicted down)
strategy_returns = pred_series.shift(1) * aligned_returns
strategy_returns = strategy_returns.dropna()
# Buy and hold returns
bh_returns = aligned_returns
# Cumulative returns
cum_strategy = (1 + strategy_returns).cumprod().iloc[-1] - 1
cum_bh = (1 + bh_returns).cumprod().iloc[-1] - 1
# Sharpe Ratio (annualized)
daily_rf = (1 + risk_free) ** (1/252) - 1
excess_returns = strategy_returns - daily_rf
sharpe = np.sqrt(252) * excess_returns.mean() / excess_returns.std()
# Maximum Drawdown
cum_returns = (1 + strategy_returns).cumprod()
running_max = cum_returns.expanding().max()
drawdown = (cum_returns - running_max) / running_max
max_drawdown = drawdown.min()
# Win Rate (on actual trades)
trades = strategy_returns[pred_series.shift(1) == 1]
win_rate = (trades > 0).mean() if len(trades) > 0 else 0
# Profit Factor
gains = trades[trades > 0].sum()
losses = abs(trades[trades < 0].sum())
profit_factor = gains / losses if losses > 0 else np.inf
return {
'total_return': cum_strategy,
'bh_return': cum_bh,
'outperformance': cum_strategy - cum_bh,
'sharpe_ratio': sharpe,
'max_drawdown': max_drawdown,
'win_rate': win_rate,
'profit_factor': profit_factor,
'n_trades': len(trades)
}
# Calculate
returns = df['Close'].pct_change().shift(-1)
test_returns = returns.loc[y_test.index]
financial = calculate_financial_metrics(y_test, y_pred, test_returns)
print("Financial Performance Metrics:")
print(f" Total Return: {financial['total_return']:.2%}")
print(f" Buy & Hold: {financial['bh_return']:.2%}")
print(f" Outperformance: {financial['outperformance']:.2%}")
print(f"\n Sharpe Ratio: {financial['sharpe_ratio']:.2f}")
print(f" Max Drawdown: {financial['max_drawdown']:.2%}")
print(f"\n Win Rate: {financial['win_rate']:.2%}")
print(f" Profit Factor: {financial['profit_factor']:.2f}")
print(f" Number of Trades: {financial['n_trades']}")
# Visualize strategy performance
# Calculate cumulative returns
pred_series = pd.Series(y_pred, index=y_test.index)
strategy_returns = pred_series.shift(1) * test_returns
strategy_returns = strategy_returns.dropna()
cum_strategy = (1 + strategy_returns).cumprod()
cum_bh = (1 + test_returns.loc[strategy_returns.index]).cumprod()
fig, axes = plt.subplots(2, 1, figsize=(14, 10))
# Cumulative returns
axes[0].plot(cum_strategy.index, cum_strategy, label='Strategy', linewidth=2)
axes[0].plot(cum_bh.index, cum_bh, label='Buy & Hold', linewidth=2, alpha=0.7)
axes[0].set_ylabel('Cumulative Return')
axes[0].set_title('Strategy vs Buy & Hold Performance')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# Drawdown
running_max = cum_strategy.expanding().max()
drawdown = (cum_strategy - running_max) / running_max * 100
axes[1].fill_between(drawdown.index, drawdown, 0, color='red', alpha=0.3)
axes[1].plot(drawdown.index, drawdown, color='red', linewidth=1)
axes[1].set_ylabel('Drawdown (%)')
axes[1].set_title('Strategy Drawdown')
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Exercise 7.3: Complete Financial Evaluator (Guided)
def evaluate_trading_strategy(predictions: np.ndarray, returns: pd.Series,
index: pd.DatetimeIndex) -> pd.DataFrame:
"""
Evaluate trading strategy with comprehensive metrics.
Returns:
DataFrame with daily performance and summary statistics
"""
# Create aligned series
pred_series = pd.Series(predictions, index=index)
aligned_returns = returns.loc[index]
# TODO: Calculate strategy returns
strategy_returns = pred_series.______(1) * aligned_returns
# Build results DataFrame
results = pd.DataFrame(index=index)
results['prediction'] = pred_series
results['actual_return'] = aligned_returns
results['strategy_return'] = strategy_returns
# TODO: Calculate cumulative returns
results['cum_strategy'] = (1 + results['strategy_return'].fillna(0)).______()
results['cum_bh'] = (1 + results['actual_return'].fillna(0)).______()
# Drawdown calculation
running_max = results['cum_strategy'].expanding().max()
results['drawdown'] = (results['cum_strategy'] - running_max) / running_max
return results
# Test the function
# eval_results = evaluate_trading_strategy(y_pred, test_returns, y_test.index)
Solution 7.3
def evaluate_trading_strategy(predictions: np.ndarray, returns: pd.Series,
index: pd.DatetimeIndex) -> pd.DataFrame:
"""
Evaluate trading strategy with comprehensive metrics.
"""
pred_series = pd.Series(predictions, index=index)
aligned_returns = returns.loc[index]
strategy_returns = pred_series.shift(1) * aligned_returns
results = pd.DataFrame(index=index)
results['prediction'] = pred_series
results['actual_return'] = aligned_returns
results['strategy_return'] = strategy_returns
results['cum_strategy'] = (1 + results['strategy_return'].fillna(0)).cumprod()
results['cum_bh'] = (1 + results['actual_return'].fillna(0)).cumprod()
running_max = results['cum_strategy'].expanding().max()
results['drawdown'] = (results['cum_strategy'] - running_max) / running_max
return results
Section 4: Walk-Forward Validation
Proper time-series validation for realistic performance assessment.
# Walk-Forward Validation Concepts
wf_concepts = """
WALK-FORWARD VALIDATION
=======================
Why Walk-Forward?
-----------------
- Standard k-fold CV uses future data to predict past (leakage!)
- Time series requires respecting temporal order
- Simulates real trading: train on past, predict future
Types of Time Series CV:
------------------------
1. Expanding Window:
Train: [====] → Test: [=]
Train: [=====] → Test: [=]
Train: [======] → Test: [=]
2. Rolling Window (Fixed):
Train: [====] → Test: [=]
Train: [====] → Test: [=]
Train: [====] → Test: [=]
3. Purging & Embargo:
Train: [====]___Gap___Test: [=]
- Purge: Remove training data too close to test
- Embargo: Gap between train and test
- Prevents label leakage in overlapping features
Walk-Forward Process:
---------------------
1. Define initial training window
2. Train model on training window
3. Make predictions on test window
4. Roll forward by step size
5. Repeat until end of data
6. Aggregate all out-of-sample predictions
"""
print(wf_concepts)
# Walk-Forward Validation Implementation
class WalkForwardValidator:
"""Walk-forward validation for time series ML."""
def __init__(self, train_size: int = 252, test_size: int = 21,
step_size: int = 21, purge_size: int = 0):
"""
Args:
train_size: Number of days in training window
test_size: Number of days in test window
step_size: Number of days to step forward
purge_size: Number of days to purge between train and test
"""
self.train_size = train_size
self.test_size = test_size
self.step_size = step_size
self.purge_size = purge_size
def split(self, X: pd.DataFrame) -> List[Tuple[np.ndarray, np.ndarray]]:
"""Generate train/test splits."""
n = len(X)
splits = []
start = 0
while start + self.train_size + self.purge_size + self.test_size <= n:
train_end = start + self.train_size
test_start = train_end + self.purge_size
test_end = test_start + self.test_size
train_idx = np.arange(start, train_end)
test_idx = np.arange(test_start, test_end)
splits.append((train_idx, test_idx))
start += self.step_size
return splits
def validate(self, model, X: pd.DataFrame, y: pd.Series,
scaler=None) -> Dict:
"""Run walk-forward validation."""
splits = self.split(X)
all_predictions = []
all_actuals = []
all_probas = []
fold_metrics = []
for i, (train_idx, test_idx) in enumerate(splits):
X_train = X.iloc[train_idx]
X_test = X.iloc[test_idx]
y_train = y.iloc[train_idx]
y_test = y.iloc[test_idx]
# Scale if scaler provided
if scaler:
scaler_fold = scaler.__class__()
X_train = scaler_fold.fit_transform(X_train)
X_test = scaler_fold.transform(X_test)
# Train and predict
model_fold = model.__class__(**model.get_params())
model_fold.fit(X_train, y_train)
pred = model_fold.predict(X_test)
proba = model_fold.predict_proba(X_test)
all_predictions.extend(pred)
all_actuals.extend(y_test.values)
all_probas.extend(proba[:, 1])
fold_metrics.append({
'fold': i,
'train_start': X.index[train_idx[0]],
'train_end': X.index[train_idx[-1]],
'test_start': X.index[test_idx[0]],
'test_end': X.index[test_idx[-1]],
'accuracy': accuracy_score(y_test, pred),
'auc': roc_auc_score(y_test, proba[:, 1])
})
return {
'predictions': np.array(all_predictions),
'actuals': np.array(all_actuals),
'probas': np.array(all_probas),
'fold_metrics': pd.DataFrame(fold_metrics)
}
# Run walk-forward validation
X_full = pd.concat([X_train, X_test])
y_full = pd.concat([y_train, y_test])
wf = WalkForwardValidator(train_size=200, test_size=20, step_size=20)
wf_results = wf.validate(model, X_full, y_full, scaler=StandardScaler())
print(f"Walk-Forward Validation:")
print(f" Total folds: {len(wf_results['fold_metrics'])}")
print(f" Total predictions: {len(wf_results['predictions'])}")
print(f"\n Overall Accuracy: {accuracy_score(wf_results['actuals'], wf_results['predictions']):.4f}")
print(f" Overall AUC: {roc_auc_score(wf_results['actuals'], wf_results['probas']):.4f}")
# Visualize walk-forward results
fold_df = wf_results['fold_metrics']
fig, axes = plt.subplots(2, 1, figsize=(14, 8))
# Accuracy across folds
axes[0].bar(fold_df['fold'], fold_df['accuracy'], color='steelblue', alpha=0.7)
axes[0].axhline(y=0.5, color='red', linestyle='--', label='Random')
axes[0].axhline(y=fold_df['accuracy'].mean(), color='green', linestyle='-',
label=f'Mean: {fold_df["accuracy"].mean():.2%}')
axes[0].set_xlabel('Fold')
axes[0].set_ylabel('Accuracy')
axes[0].set_title('Walk-Forward Accuracy by Fold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# AUC across folds
axes[1].bar(fold_df['fold'], fold_df['auc'], color='forestgreen', alpha=0.7)
axes[1].axhline(y=0.5, color='red', linestyle='--', label='Random')
axes[1].axhline(y=fold_df['auc'].mean(), color='blue', linestyle='-',
label=f'Mean: {fold_df["auc"].mean():.4f}')
axes[1].set_xlabel('Fold')
axes[1].set_ylabel('AUC')
axes[1].set_title('Walk-Forward AUC by Fold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("\nFold-level Metrics Summary:")
print(f" Accuracy: {fold_df['accuracy'].mean():.4f} +/- {fold_df['accuracy'].std():.4f}")
print(f" AUC: {fold_df['auc'].mean():.4f} +/- {fold_df['auc'].std():.4f}")
# Exercise 7.4: Complete Model Evaluator (Open-ended)
#
# Build a ModelEvaluator class that:
# - Calculates all classification metrics (accuracy, precision, recall, F1, AUC)
# - Calculates financial metrics (return, Sharpe, drawdown, win rate)
# - Supports walk-forward validation
# - Generates a comprehensive report
# - Creates visualization plots (ROC, confusion matrix, returns)
#
# Your implementation:
Solution 7.4
class ModelEvaluator:
"""Comprehensive model evaluation for trading ML."""
def __init__(self, model, X_train, y_train, X_test, y_test, returns):
self.model = model
self.X_train = X_train
self.y_train = y_train
self.X_test = X_test
self.y_test = y_test
self.returns = returns
# Predictions
self.y_pred = model.predict(X_test)
self.y_proba = model.predict_proba(X_test)
def get_classification_metrics(self) -> Dict:
"""Calculate all classification metrics."""
return {
'accuracy': accuracy_score(self.y_test, self.y_pred),
'precision': precision_score(self.y_test, self.y_pred),
'recall': recall_score(self.y_test, self.y_pred),
'f1': f1_score(self.y_test, self.y_pred),
'roc_auc': roc_auc_score(self.y_test, self.y_proba[:, 1])
}
def get_financial_metrics(self) -> Dict:
"""Calculate financial performance metrics."""
test_returns = self.returns.loc[self.y_test.index]
pred_series = pd.Series(self.y_pred, index=self.y_test.index)
strategy_returns = pred_series.shift(1) * test_returns
strategy_returns = strategy_returns.dropna()
cum_return = (1 + strategy_returns).cumprod().iloc[-1] - 1
sharpe = np.sqrt(252) * strategy_returns.mean() / strategy_returns.std()
cum_rets = (1 + strategy_returns).cumprod()
running_max = cum_rets.expanding().max()
max_dd = ((cum_rets - running_max) / running_max).min()
trades = strategy_returns[pred_series.shift(1) == 1]
win_rate = (trades > 0).mean() if len(trades) > 0 else 0
return {
'total_return': cum_return,
'sharpe_ratio': sharpe,
'max_drawdown': max_dd,
'win_rate': win_rate,
'n_trades': len(trades)
}
def walk_forward_validate(self, train_size: int = 200,
test_size: int = 20) -> Dict:
"""Run walk-forward validation."""
X_full = pd.concat([self.X_train, self.X_test])
y_full = pd.concat([self.y_train, self.y_test])
wf = WalkForwardValidator(train_size, test_size, test_size)
return wf.validate(self.model, X_full, y_full, StandardScaler())
def generate_report(self) -> pd.DataFrame:
"""Generate comprehensive report."""
clf_metrics = self.get_classification_metrics()
fin_metrics = self.get_financial_metrics()
all_metrics = {**clf_metrics, **fin_metrics}
return pd.DataFrame({
'Metric': list(all_metrics.keys()),
'Value': [f'{v:.4f}' if isinstance(v, float) else str(v)
for v in all_metrics.values()]
})
def plot_all(self):
"""Generate all visualization plots."""
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# ROC Curve
fpr, tpr, _ = roc_curve(self.y_test, self.y_proba[:, 1])
axes[0, 0].plot(fpr, tpr, 'b-', lw=2)
axes[0, 0].plot([0, 1], [0, 1], 'r--')
axes[0, 0].set_title(f'ROC Curve (AUC={auc(fpr, tpr):.3f})')
axes[0, 0].set_xlabel('FPR')
axes[0, 0].set_ylabel('TPR')
# Confusion Matrix
cm = confusion_matrix(self.y_test, self.y_pred)
sns.heatmap(cm, annot=True, fmt='d', ax=axes[0, 1], cmap='Blues')
axes[0, 1].set_title('Confusion Matrix')
# Cumulative Returns
test_returns = self.returns.loc[self.y_test.index]
pred_series = pd.Series(self.y_pred, index=self.y_test.index)
strategy_rets = (pred_series.shift(1) * test_returns).fillna(0)
cum_strategy = (1 + strategy_rets).cumprod()
cum_bh = (1 + test_returns.fillna(0)).cumprod()
axes[1, 0].plot(cum_strategy, label='Strategy')
axes[1, 0].plot(cum_bh, label='Buy & Hold', alpha=0.7)
axes[1, 0].set_title('Cumulative Returns')
axes[1, 0].legend()
# Metrics Summary
report = self.generate_report()
axes[1, 1].axis('off')
table = axes[1, 1].table(
cellText=report.values,
colLabels=report.columns,
loc='center',
cellLoc='left'
)
table.auto_set_font_size(False)
table.set_fontsize(10)
axes[1, 1].set_title('Performance Metrics')
plt.tight_layout()
plt.show()
# Exercise 7.5: Threshold Optimizer (Open-ended)
#
# Build a ThresholdOptimizer class that:
# - Takes predicted probabilities and actual labels
# - Finds optimal threshold for different objectives:
# - Maximize accuracy
# - Maximize F1 score
# - Maximize financial returns
# - Balance precision and recall
# - Visualizes trade-offs at different thresholds
# - Returns recommendations with reasoning
#
# Your implementation:
Solution 7.5
class ThresholdOptimizer:
"""Optimize classification threshold for different objectives."""
def __init__(self, y_true: pd.Series, y_proba: np.ndarray, returns: pd.Series = None):
self.y_true = y_true
self.y_proba = y_proba[:, 1] if y_proba.ndim > 1 else y_proba
self.returns = returns
self.thresholds = np.linspace(0.01, 0.99, 99)
def _evaluate_threshold(self, threshold: float) -> Dict:
"""Evaluate metrics at given threshold."""
y_pred = (self.y_proba >= threshold).astype(int)
metrics = {
'threshold': threshold,
'accuracy': accuracy_score(self.y_true, y_pred),
'precision': precision_score(self.y_true, y_pred, zero_division=0),
'recall': recall_score(self.y_true, y_pred, zero_division=0),
'f1': f1_score(self.y_true, y_pred, zero_division=0),
'n_predictions': sum(y_pred)
}
if self.returns is not None:
pred_series = pd.Series(y_pred, index=self.y_true.index)
strat_rets = (pred_series.shift(1) * self.returns.loc[self.y_true.index]).dropna()
metrics['total_return'] = (1 + strat_rets).cumprod().iloc[-1] - 1 if len(strat_rets) > 0 else 0
return metrics
def optimize(self, objective: str = 'f1') -> Dict:
"""Find optimal threshold for given objective."""
results = [self._evaluate_threshold(t) for t in self.thresholds]
df = pd.DataFrame(results)
if objective in df.columns:
best_idx = df[objective].idxmax()
return {
'optimal_threshold': df.loc[best_idx, 'threshold'],
'optimal_value': df.loc[best_idx, objective],
'metrics_at_optimal': df.loc[best_idx].to_dict(),
'all_results': df
}
else:
raise ValueError(f"Unknown objective: {objective}")
def plot_tradeoffs(self):
"""Visualize metrics across thresholds."""
results = [self._evaluate_threshold(t) for t in self.thresholds]
df = pd.DataFrame(results)
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Accuracy and F1
axes[0, 0].plot(df['threshold'], df['accuracy'], label='Accuracy')
axes[0, 0].plot(df['threshold'], df['f1'], label='F1')
axes[0, 0].set_xlabel('Threshold')
axes[0, 0].set_title('Accuracy and F1 vs Threshold')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)
# Precision and Recall
axes[0, 1].plot(df['threshold'], df['precision'], label='Precision')
axes[0, 1].plot(df['threshold'], df['recall'], label='Recall')
axes[0, 1].set_xlabel('Threshold')
axes[0, 1].set_title('Precision and Recall vs Threshold')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)
# Number of predictions
axes[1, 0].plot(df['threshold'], df['n_predictions'])
axes[1, 0].set_xlabel('Threshold')
axes[1, 0].set_ylabel('Number of Positive Predictions')
axes[1, 0].set_title('Trade Frequency vs Threshold')
axes[1, 0].grid(True, alpha=0.3)
# Returns if available
if 'total_return' in df.columns:
axes[1, 1].plot(df['threshold'], df['total_return'])
axes[1, 1].set_xlabel('Threshold')
axes[1, 1].set_ylabel('Total Return')
axes[1, 1].set_title('Returns vs Threshold')
axes[1, 1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
def recommend(self) -> str:
"""Provide threshold recommendation."""
acc_opt = self.optimize('accuracy')
f1_opt = self.optimize('f1')
rec = f"""Threshold Recommendations:
1. For Maximum Accuracy: {acc_opt['optimal_threshold']:.3f}
Accuracy: {acc_opt['optimal_value']:.4f}
2. For Maximum F1: {f1_opt['optimal_threshold']:.3f}
F1: {f1_opt['optimal_value']:.4f}"""
if self.returns is not None:
ret_opt = self.optimize('total_return')
rec += f"""
3. For Maximum Returns: {ret_opt['optimal_threshold']:.3f}
Return: {ret_opt['optimal_value']:.4f}"""
return rec
# Exercise 7.6: Model Comparison Dashboard (Open-ended)
#
# Build a ModelComparisonDashboard class that:
# - Takes multiple trained models
# - Compares them on all metrics (ML and financial)
# - Generates comparison tables and plots
# - Ranks models by different criteria
# - Provides a final recommendation
# - Exports results to a report
#
# Your implementation:
Solution 7.6
class ModelComparisonDashboard:
"""Compare multiple models comprehensively."""
def __init__(self, models: Dict, X_train, y_train, X_test, y_test, returns):
self.models = models
self.X_train = X_train
self.y_train = y_train
self.X_test = X_test
self.y_test = y_test
self.returns = returns
self.results = {}
def evaluate_all(self):
"""Evaluate all models."""
for name, model in self.models.items():
# Get predictions
y_pred = model.predict(self.X_test)
y_proba = model.predict_proba(self.X_test)
# ML metrics
ml_metrics = {
'accuracy': accuracy_score(self.y_test, y_pred),
'precision': precision_score(self.y_test, y_pred),
'recall': recall_score(self.y_test, y_pred),
'f1': f1_score(self.y_test, y_pred),
'auc': roc_auc_score(self.y_test, y_proba[:, 1])
}
# Financial metrics
test_returns = self.returns.loc[self.y_test.index]
pred_series = pd.Series(y_pred, index=self.y_test.index)
strat_rets = (pred_series.shift(1) * test_returns).dropna()
fin_metrics = {
'total_return': (1 + strat_rets).cumprod().iloc[-1] - 1,
'sharpe': np.sqrt(252) * strat_rets.mean() / strat_rets.std() if strat_rets.std() > 0 else 0,
'win_rate': (strat_rets[pred_series.shift(1) == 1] > 0).mean()
}
self.results[name] = {**ml_metrics, **fin_metrics}
return self
def get_comparison_table(self) -> pd.DataFrame:
"""Get comparison table."""
return pd.DataFrame(self.results).T
def rank_models(self, by: str = 'f1') -> pd.DataFrame:
"""Rank models by specific metric."""
df = self.get_comparison_table()
return df.sort_values(by, ascending=False)
def plot_comparison(self):
"""Plot model comparison."""
df = self.get_comparison_table()
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Accuracy comparison
df['accuracy'].plot(kind='bar', ax=axes[0, 0], color='steelblue')
axes[0, 0].set_title('Accuracy')
axes[0, 0].axhline(y=0.5, color='red', linestyle='--')
# AUC comparison
df['auc'].plot(kind='bar', ax=axes[0, 1], color='forestgreen')
axes[0, 1].set_title('AUC')
axes[0, 1].axhline(y=0.5, color='red', linestyle='--')
# Returns comparison
df['total_return'].plot(kind='bar', ax=axes[1, 0], color='darkorange')
axes[1, 0].set_title('Total Return')
axes[1, 0].axhline(y=0, color='gray', linestyle='--')
# Sharpe comparison
df['sharpe'].plot(kind='bar', ax=axes[1, 1], color='purple')
axes[1, 1].set_title('Sharpe Ratio')
axes[1, 1].axhline(y=0, color='gray', linestyle='--')
for ax in axes.flat:
ax.tick_params(axis='x', rotation=45)
plt.tight_layout()
plt.show()
def recommend(self) -> str:
"""Recommend best model."""
df = self.get_comparison_table()
best_ml = df['f1'].idxmax()
best_financial = df['total_return'].idxmax()
# Score models on normalized metrics
normalized = (df - df.min()) / (df.max() - df.min())
combined_score = normalized.mean(axis=1)
best_overall = combined_score.idxmax()
return f"""Model Recommendations:
Best ML Performance: {best_ml}
F1 Score: {df.loc[best_ml, 'f1']:.4f}
AUC: {df.loc[best_ml, 'auc']:.4f}
Best Financial Performance: {best_financial}
Total Return: {df.loc[best_financial, 'total_return']:.4f}
Sharpe Ratio: {df.loc[best_financial, 'sharpe']:.2f}
Best Overall (Balanced): {best_overall}
Combined Score: {combined_score[best_overall]:.4f}"""
def export_report(self, filepath: str = 'model_comparison.csv'):
"""Export comparison to CSV."""
df = self.get_comparison_table()
df.to_csv(filepath)
print(f"Report exported to {filepath}")
Module Project: Complete Model Evaluation Pipeline
Build a comprehensive evaluation system that combines all concepts.
class MLTradingEvaluator:
"""
Complete evaluation pipeline for ML trading models.
Combines classification metrics, financial metrics,
walk-forward validation, and threshold optimization.
"""
def __init__(self, model):
self.model = model
self.scaler = StandardScaler()
self.evaluation_results = {}
def prepare_data(self, df: pd.DataFrame, test_size: float = 0.2):
"""Prepare features and split data."""
# Features
features = pd.DataFrame(index=df.index)
features['returns'] = df['Close'].pct_change()
features['volatility'] = features['returns'].rolling(20).std()
for p in [5, 10, 20]:
features[f'momentum_{p}'] = df['Close'].pct_change(p)
for p in [5, 20, 50]:
ma = df['Close'].rolling(p).mean()
features[f'dist_ma{p}'] = (df['Close'] - ma) / ma
delta = df['Close'].diff()
gain = (delta.where(delta > 0, 0)).rolling(14).mean()
loss = (-delta.where(delta < 0, 0)).rolling(14).mean()
features['rsi'] = 100 - (100 / (1 + gain / loss))
features['volume_ratio'] = df['Volume'] / df['Volume'].rolling(20).mean()
# Target and returns
features['target'] = (features['returns'].shift(-1) > 0).astype(int)
features['next_return'] = features['returns'].shift(-1)
features = features.dropna()
# Feature columns
feature_cols = [c for c in features.columns if c not in ['target', 'next_return']]
# Split
split_idx = int(len(features) * (1 - test_size))
self.X_train = features[feature_cols][:split_idx]
self.X_test = features[feature_cols][split_idx:]
self.y_train = features['target'][:split_idx]
self.y_test = features['target'][split_idx:]
self.returns = features['next_return']
return self
def train_and_predict(self):
"""Train model and get predictions."""
X_train_scaled = self.scaler.fit_transform(self.X_train)
X_test_scaled = self.scaler.transform(self.X_test)
self.model.fit(X_train_scaled, self.y_train)
self.y_pred = self.model.predict(X_test_scaled)
self.y_proba = self.model.predict_proba(X_test_scaled)
return self
def evaluate_classification(self) -> Dict:
"""Calculate classification metrics."""
metrics = {
'accuracy': accuracy_score(self.y_test, self.y_pred),
'precision': precision_score(self.y_test, self.y_pred),
'recall': recall_score(self.y_test, self.y_pred),
'f1': f1_score(self.y_test, self.y_pred),
'roc_auc': roc_auc_score(self.y_test, self.y_proba[:, 1])
}
self.evaluation_results['classification'] = metrics
return metrics
def evaluate_financial(self) -> Dict:
"""Calculate financial metrics."""
test_returns = self.returns.loc[self.y_test.index]
pred_series = pd.Series(self.y_pred, index=self.y_test.index)
strategy_returns = pred_series.shift(1) * test_returns
strategy_returns = strategy_returns.dropna()
cum_return = (1 + strategy_returns).cumprod().iloc[-1] - 1
cum_bh = (1 + test_returns.loc[strategy_returns.index]).cumprod().iloc[-1] - 1
sharpe = np.sqrt(252) * strategy_returns.mean() / strategy_returns.std()
cum_rets = (1 + strategy_returns).cumprod()
running_max = cum_rets.expanding().max()
max_dd = ((cum_rets - running_max) / running_max).min()
trades = strategy_returns[pred_series.shift(1) == 1]
win_rate = (trades > 0).mean() if len(trades) > 0 else 0
metrics = {
'total_return': cum_return,
'buy_hold_return': cum_bh,
'outperformance': cum_return - cum_bh,
'sharpe_ratio': sharpe,
'max_drawdown': max_dd,
'win_rate': win_rate,
'n_trades': len(trades)
}
self.evaluation_results['financial'] = metrics
return metrics
def run_full_evaluation(self) -> pd.DataFrame:
"""Run complete evaluation."""
clf_metrics = self.evaluate_classification()
fin_metrics = self.evaluate_financial()
all_metrics = []
for name, value in clf_metrics.items():
all_metrics.append({'Category': 'Classification', 'Metric': name, 'Value': f'{value:.4f}'})
for name, value in fin_metrics.items():
if isinstance(value, float):
all_metrics.append({'Category': 'Financial', 'Metric': name, 'Value': f'{value:.4f}'})
else:
all_metrics.append({'Category': 'Financial', 'Metric': name, 'Value': str(value)})
return pd.DataFrame(all_metrics)
def plot_evaluation(self):
"""Create evaluation visualization."""
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# ROC Curve
fpr, tpr, _ = roc_curve(self.y_test, self.y_proba[:, 1])
axes[0, 0].plot(fpr, tpr, 'b-', lw=2, label=f'AUC = {auc(fpr, tpr):.3f}')
axes[0, 0].plot([0, 1], [0, 1], 'r--')
axes[0, 0].set_title('ROC Curve')
axes[0, 0].set_xlabel('False Positive Rate')
axes[0, 0].set_ylabel('True Positive Rate')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)
# Confusion Matrix
cm = confusion_matrix(self.y_test, self.y_pred)
sns.heatmap(cm, annot=True, fmt='d', ax=axes[0, 1], cmap='Blues',
xticklabels=['DOWN', 'UP'], yticklabels=['DOWN', 'UP'])
axes[0, 1].set_title('Confusion Matrix')
axes[0, 1].set_xlabel('Predicted')
axes[0, 1].set_ylabel('Actual')
# Cumulative Returns
test_returns = self.returns.loc[self.y_test.index]
pred_series = pd.Series(self.y_pred, index=self.y_test.index)
strategy_rets = (pred_series.shift(1) * test_returns).fillna(0)
cum_strategy = (1 + strategy_rets).cumprod()
cum_bh = (1 + test_returns.fillna(0)).cumprod()
axes[1, 0].plot(cum_strategy.index, cum_strategy, label='Strategy', lw=2)
axes[1, 0].plot(cum_bh.index, cum_bh, label='Buy & Hold', lw=2, alpha=0.7)
axes[1, 0].set_title('Cumulative Returns')
axes[1, 0].set_ylabel('Cumulative Return')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)
# Metrics Summary
metrics_text = []
for cat, mets in self.evaluation_results.items():
metrics_text.append(f"\n{cat.upper()}:")
for k, v in mets.items():
if isinstance(v, float):
metrics_text.append(f" {k}: {v:.4f}")
else:
metrics_text.append(f" {k}: {v}")
axes[1, 1].text(0.1, 0.9, '\n'.join(metrics_text), transform=axes[1, 1].transAxes,
fontsize=11, verticalalignment='top', fontfamily='monospace')
axes[1, 1].axis('off')
axes[1, 1].set_title('Evaluation Summary')
plt.tight_layout()
plt.show()
# Run the complete evaluation pipeline
# Get data
ticker = yf.Ticker("SPY")
data = ticker.history(period="2y")
# Create evaluator
evaluator = MLTradingEvaluator(
RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
)
# Run pipeline
evaluator.prepare_data(data)
evaluator.train_and_predict()
# Get full evaluation
report = evaluator.run_full_evaluation()
print("\nFull Evaluation Report:")
print(report.to_string(index=False))
# Visualize evaluation
evaluator.plot_evaluation()
Key Takeaways
-
Accuracy alone is insufficient for evaluating trading models; always use precision, recall, F1, and AUC
-
Confusion matrices reveal error patterns that summary metrics hide
-
ROC curves and AUC provide threshold-independent model comparison
-
Financial metrics (returns, Sharpe, drawdown) matter more than ML metrics for trading
-
Walk-forward validation simulates real trading conditions and prevents overfitting
-
Threshold optimization can significantly impact trading performance
-
Compare multiple metrics across different objectives (ML vs financial) before selecting a model
Next: Module 8 - Regression Models (Return prediction, volatility forecasting)
Module 8: Regression Models
Part 3: Advanced Techniques
| Duration | Exercises | Prerequisites |
|---|---|---|
| ~2.5 hours | 6 | Modules 1-7 |
Learning Objectives
By the end of this module, you will be able to: - Apply regression models for return prediction - Forecast volatility using various techniques - Implement quantile regression for tail risk - Use ensemble methods for regression - Evaluate regression models with appropriate metrics
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Dict, List, Tuple, Optional
import warnings
warnings.filterwarnings('ignore')
# Regression models
from sklearn.linear_model import (
LinearRegression, Ridge, Lasso, ElasticNet
)
from sklearn.ensemble import (
RandomForestRegressor, GradientBoostingRegressor
)
from sklearn.svm import SVR
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import TimeSeriesSplit, cross_val_score
from sklearn.metrics import (
mean_squared_error, mean_absolute_error, r2_score
)
import yfinance as yf
print("Module 8: Regression Models")
print("=" * 40)
# Prepare regression data
def prepare_regression_data(symbol: str = "SPY", period: str = "2y") -> Tuple:
"""Prepare features and continuous target for regression."""
ticker = yf.Ticker(symbol)
df = ticker.history(period=period)
# Features
df['returns'] = df['Close'].pct_change()
df['volatility'] = df['returns'].rolling(20).std()
for p in [5, 10, 20]:
df[f'momentum_{p}'] = df['Close'].pct_change(p)
for p in [5, 20, 50]:
ma = df['Close'].rolling(p).mean()
df[f'dist_ma{p}'] = (df['Close'] - ma) / ma
delta = df['Close'].diff()
gain = (delta.where(delta > 0, 0)).rolling(14).mean()
loss = (-delta.where(delta < 0, 0)).rolling(14).mean()
df['rsi'] = 100 - (100 / (1 + gain / loss))
df['volume_ratio'] = df['Volume'] / df['Volume'].rolling(20).mean()
# Continuous target: next day return
df['target_return'] = df['returns'].shift(-1)
# Alternative target: next 5-day return
df['target_5d_return'] = df['Close'].pct_change(5).shift(-5)
# Volatility target
df['target_volatility'] = df['volatility'].shift(-1)
df = df.dropna()
features = ['volatility', 'momentum_5', 'momentum_10', 'momentum_20',
'dist_ma5', 'dist_ma20', 'dist_ma50', 'rsi', 'volume_ratio']
return df, features
# Load data
df, feature_cols = prepare_regression_data()
print(f"Data shape: {df.shape}")
print(f"Features: {feature_cols}")
Section 1: Linear Regression Models
Linear models are simple, interpretable, and often surprisingly effective for financial prediction.
# Linear Regression Concepts
linear_concepts = """
LINEAR REGRESSION FOR FINANCE
=============================
Basic Model:
------------
y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ + ε
Where:
- y: Target (e.g., next day return)
- xᵢ: Features (momentum, volatility, etc.)
- βᵢ: Coefficients to learn
- ε: Error term
Regularization Types:
---------------------
1. Ridge (L2): Shrinks coefficients, handles multicollinearity
Loss = MSE + α * Σβᵢ²
2. Lasso (L1): Can zero out coefficients (feature selection)
Loss = MSE + α * Σ|βᵢ|
3. ElasticNet: Combination of L1 and L2
Loss = MSE + α * (r * Σ|βᵢ| + (1-r) * Σβᵢ²)
Advantages:
-----------
+ Interpretable coefficients
+ Fast to train and predict
+ Regularization prevents overfitting
Disadvantages:
--------------
- Assumes linear relationships
- May underfit complex patterns
- Sensitive to outliers
Financial Considerations:
-------------------------
- Returns are often nearly unpredictable (efficient markets)
- Small R² is normal (0.01-0.05 can be profitable)
- Coefficients show factor exposure
"""
print(linear_concepts)
# Prepare data for regression
X = df[feature_cols]
y = df['target_return']
# Time series split
split_idx = int(len(X) * 0.8)
X_train, X_test = X[:split_idx], X[split_idx:]
y_train, y_test = y[:split_idx], y[split_idx:]
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print(f"Training: {len(X_train)}, Test: {len(X_test)}")
print(f"\nTarget Statistics:")
print(f" Mean: {y_train.mean():.6f}")
print(f" Std: {y_train.std():.6f}")
# Basic Linear Regression
lr = LinearRegression()
lr.fit(X_train_scaled, y_train)
y_pred_lr = lr.predict(X_test_scaled)
# Metrics
mse = mean_squared_error(y_test, y_pred_lr)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred_lr)
r2 = r2_score(y_test, y_pred_lr)
print(f"Linear Regression Results:")
print(f" RMSE: {rmse:.6f}")
print(f" MAE: {mae:.6f}")
print(f" R²: {r2:.4f}")
# Coefficients
coef_df = pd.DataFrame({
'feature': feature_cols,
'coefficient': lr.coef_
}).sort_values('coefficient', key=abs, ascending=False)
print(f"\nFeature Coefficients:")
for _, row in coef_df.iterrows():
print(f" {row['feature']:15s}: {row['coefficient']:+.6f}")
# Compare regularization methods
models = {
'OLS': LinearRegression(),
'Ridge': Ridge(alpha=1.0),
'Lasso': Lasso(alpha=0.001),
'ElasticNet': ElasticNet(alpha=0.001, l1_ratio=0.5)
}
results = []
for name, model in models.items():
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
results.append({
'Model': name,
'RMSE': np.sqrt(mean_squared_error(y_test, y_pred)),
'MAE': mean_absolute_error(y_test, y_pred),
'R2': r2_score(y_test, y_pred)
})
results_df = pd.DataFrame(results)
print("Linear Models Comparison:")
print(results_df.to_string(index=False))
# Exercise 8.1: Regularization Tuner (Guided)
def tune_ridge_alpha(X_train: np.ndarray, y_train: pd.Series,
alphas: List[float] = None,
cv_folds: int = 5) -> Dict:
"""
Tune Ridge regression alpha using time series CV.
Returns:
Dictionary with best alpha and cross-validation results
"""
if alphas is None:
alphas = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
# TODO: Create time series cross-validator
tscv = ______(n_splits=cv_folds)
results = []
best_score = float('-inf')
best_alpha = None
for alpha in alphas:
# TODO: Create Ridge model with current alpha
model = ______(alpha=______)
# TODO: Get cross-validation scores (negative MSE)
scores = ______(model, X_train, y_train, cv=tscv,
scoring='neg_mean_squared_error')
mean_score = scores.mean()
results.append({
'alpha': alpha,
'mean_neg_mse': mean_score,
'std_neg_mse': scores.std()
})
if mean_score > best_score:
best_score = mean_score
best_alpha = alpha
return {
'best_alpha': best_alpha,
'best_score': best_score,
'all_results': pd.DataFrame(results)
}
# Test the function
# ridge_tuning = tune_ridge_alpha(X_train_scaled, y_train)
Solution 8.1
def tune_ridge_alpha(X_train: np.ndarray, y_train: pd.Series,
alphas: List[float] = None,
cv_folds: int = 5) -> Dict:
"""
Tune Ridge regression alpha using time series CV.
"""
if alphas is None:
alphas = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
tscv = TimeSeriesSplit(n_splits=cv_folds)
results = []
best_score = float('-inf')
best_alpha = None
for alpha in alphas:
model = Ridge(alpha=alpha)
scores = cross_val_score(model, X_train, y_train, cv=tscv,
scoring='neg_mean_squared_error')
mean_score = scores.mean()
results.append({
'alpha': alpha,
'mean_neg_mse': mean_score,
'std_neg_mse': scores.std()
})
if mean_score > best_score:
best_score = mean_score
best_alpha = alpha
return {
'best_alpha': best_alpha,
'best_score': best_score,
'all_results': pd.DataFrame(results)
}
Section 2: Tree-Based Regression
Random Forest and Gradient Boosting for non-linear return prediction.
# Random Forest Regressor
rf_reg = RandomForestRegressor(
n_estimators=100,
max_depth=5,
min_samples_leaf=20,
random_state=42,
n_jobs=-1
)
rf_reg.fit(X_train_scaled, y_train)
y_pred_rf = rf_reg.predict(X_test_scaled)
print(f"Random Forest Regressor Results:")
print(f" RMSE: {np.sqrt(mean_squared_error(y_test, y_pred_rf)):.6f}")
print(f" MAE: {mean_absolute_error(y_test, y_pred_rf):.6f}")
print(f" R²: {r2_score(y_test, y_pred_rf):.4f}")
# Feature importance
importance_df = pd.DataFrame({
'feature': feature_cols,
'importance': rf_reg.feature_importances_
}).sort_values('importance', ascending=False)
print(f"\nFeature Importance:")
for _, row in importance_df.iterrows():
print(f" {row['feature']:15s}: {row['importance']:.4f}")
# Gradient Boosting Regressor
gb_reg = GradientBoostingRegressor(
n_estimators=100,
learning_rate=0.1,
max_depth=3,
min_samples_leaf=20,
subsample=0.8,
random_state=42
)
gb_reg.fit(X_train_scaled, y_train)
y_pred_gb = gb_reg.predict(X_test_scaled)
print(f"Gradient Boosting Regressor Results:")
print(f" RMSE: {np.sqrt(mean_squared_error(y_test, y_pred_gb)):.6f}")
print(f" MAE: {mean_absolute_error(y_test, y_pred_gb):.6f}")
print(f" R²: {r2_score(y_test, y_pred_gb):.4f}")
# Exercise 8.2: Ensemble Regressor (Guided)
def create_ensemble_regressor(models: List, weights: List[float] = None) -> object:
"""
Create a weighted ensemble of regression models.
Returns:
Object with fit, predict methods
"""
class EnsembleRegressor:
def __init__(self, models, weights):
self.models = models
# TODO: Set weights (equal if not provided)
self.weights = weights if weights else [1/len(______)] * len(models)
def fit(self, X, y):
# TODO: Fit all models
for model in self.______:
model.______(X, y)
return self
def predict(self, X):
# TODO: Weighted average of predictions
predictions = np.zeros(len(X))
for model, weight in zip(self.models, self.weights):
predictions += ______ * model.______(X)
return predictions
return EnsembleRegressor(models, weights)
# Test the function
# ensemble = create_ensemble_regressor(
# [Ridge(), RandomForestRegressor(n_estimators=50, max_depth=5)],
# weights=[0.3, 0.7]
# )
Solution 8.2
def create_ensemble_regressor(models: List, weights: List[float] = None) -> object:
"""
Create a weighted ensemble of regression models.
"""
class EnsembleRegressor:
def __init__(self, models, weights):
self.models = models
self.weights = weights if weights else [1/len(models)] * len(models)
def fit(self, X, y):
for model in self.models:
model.fit(X, y)
return self
def predict(self, X):
predictions = np.zeros(len(X))
for model, weight in zip(self.models, self.weights):
predictions += weight * model.predict(X)
return predictions
return EnsembleRegressor(models, weights)
Section 3: Volatility Forecasting
Predicting volatility is often easier and more useful than predicting returns.
# Volatility Forecasting Concepts
vol_concepts = """
VOLATILITY FORECASTING
======================
Why Volatility?
---------------
- More predictable than returns
- Clusters (high vol followed by high vol)
- Critical for risk management
- Used in options pricing
Common Measures:
----------------
1. Historical Volatility
σ = std(returns) * sqrt(252)
2. Realized Volatility
RV = sqrt(Σ(intraday returns)²)
3. Range-Based (Parkinson)
σ = sqrt(ln(High/Low)² / (4*ln(2)))
ML Approaches:
--------------
- Predict next day/week volatility
- Use lagged volatility as key feature
- Often more successful than return prediction
Applications:
-------------
- Position sizing
- Risk budgeting
- Options trading
- VaR calculation
"""
print(vol_concepts)
# Prepare volatility prediction data
def prepare_volatility_data(df: pd.DataFrame, vol_window: int = 20) -> Tuple:
"""Prepare features for volatility prediction."""
vol_df = pd.DataFrame(index=df.index)
# Current volatility (lagged features)
returns = df['Close'].pct_change()
vol_df['vol_20d'] = returns.rolling(vol_window).std()
vol_df['vol_5d'] = returns.rolling(5).std()
vol_df['vol_10d'] = returns.rolling(10).std()
# Volatility ratios
vol_df['vol_ratio_5_20'] = vol_df['vol_5d'] / vol_df['vol_20d']
# Range-based volatility (Parkinson)
vol_df['parkinson_vol'] = np.sqrt(
(np.log(df['High'] / df['Low']) ** 2).rolling(vol_window).mean() / (4 * np.log(2))
)
# Absolute returns
vol_df['abs_return_1d'] = returns.abs()
vol_df['abs_return_5d'] = returns.abs().rolling(5).mean()
# Volume features
vol_df['volume_ratio'] = df['Volume'] / df['Volume'].rolling(20).mean()
# Target: next day's volatility
vol_df['target_vol'] = vol_df['vol_20d'].shift(-1)
vol_df = vol_df.dropna()
features = ['vol_20d', 'vol_5d', 'vol_10d', 'vol_ratio_5_20',
'parkinson_vol', 'abs_return_1d', 'abs_return_5d', 'volume_ratio']
return vol_df[features], vol_df['target_vol']
# Prepare volatility data
X_vol, y_vol = prepare_volatility_data(df)
# Split
split_idx = int(len(X_vol) * 0.8)
X_vol_train, X_vol_test = X_vol[:split_idx], X_vol[split_idx:]
y_vol_train, y_vol_test = y_vol[:split_idx], y_vol[split_idx:]
# Scale
vol_scaler = StandardScaler()
X_vol_train_scaled = vol_scaler.fit_transform(X_vol_train)
X_vol_test_scaled = vol_scaler.transform(X_vol_test)
print(f"Volatility prediction data: {len(X_vol)} samples")
# Train volatility forecasting models
vol_models = {
'Ridge': Ridge(alpha=1.0),
'RF': RandomForestRegressor(n_estimators=100, max_depth=5, random_state=42),
'GB': GradientBoostingRegressor(n_estimators=100, max_depth=3, random_state=42)
}
print("Volatility Forecasting Results:")
print("=" * 50)
for name, model in vol_models.items():
model.fit(X_vol_train_scaled, y_vol_train)
y_pred = model.predict(X_vol_test_scaled)
rmse = np.sqrt(mean_squared_error(y_vol_test, y_pred))
mae = mean_absolute_error(y_vol_test, y_pred)
r2 = r2_score(y_vol_test, y_pred)
print(f"\n{name}:")
print(f" RMSE: {rmse:.6f}")
print(f" MAE: {mae:.6f}")
print(f" R²: {r2:.4f}")
# Visualize volatility predictions
# Use the best model (GB)
gb_vol = GradientBoostingRegressor(n_estimators=100, max_depth=3, random_state=42)
gb_vol.fit(X_vol_train_scaled, y_vol_train)
vol_pred = gb_vol.predict(X_vol_test_scaled)
# Plot
fig, axes = plt.subplots(2, 1, figsize=(14, 8))
# Actual vs Predicted
axes[0].plot(y_vol_test.index, y_vol_test.values, label='Actual', alpha=0.7)
axes[0].plot(y_vol_test.index, vol_pred, label='Predicted', alpha=0.7)
axes[0].set_ylabel('Volatility')
axes[0].set_title('Volatility Forecast: Actual vs Predicted')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# Scatter plot
axes[1].scatter(y_vol_test, vol_pred, alpha=0.5)
axes[1].plot([y_vol_test.min(), y_vol_test.max()],
[y_vol_test.min(), y_vol_test.max()], 'r--', label='Perfect')
axes[1].set_xlabel('Actual Volatility')
axes[1].set_ylabel('Predicted Volatility')
axes[1].set_title('Prediction Scatter Plot')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Exercise 8.3: Volatility Forecaster (Guided)
class VolatilityForecaster:
"""
Multi-horizon volatility forecasting system.
"""
def __init__(self, horizons: List[int] = [1, 5, 20]):
self.horizons = horizons
self.models = {}
self.scalers = {}
def create_features(self, df: pd.DataFrame) -> pd.DataFrame:
"""Create volatility features."""
features = pd.DataFrame(index=df.index)
returns = df['Close'].pct_change()
# TODO: Add volatility features for multiple windows
for window in [5, 10, 20, 60]:
features[f'vol_{window}d'] = returns.rolling(______).______()
# TODO: Add Parkinson volatility
log_hl = np.log(df['High'] / df['______'])
features['parkinson'] = np.sqrt((log_hl ** 2).rolling(20).mean() / (4 * np.log(2)))
features['volume_ratio'] = df['Volume'] / df['Volume'].rolling(20).mean()
return features.dropna()
def fit(self, df: pd.DataFrame):
"""Fit models for each horizon."""
features = self.create_features(df)
returns = df['Close'].pct_change()
for horizon in self.horizons:
# Create target: forward volatility
target = returns.rolling(horizon).std().shift(-horizon)
# Align and clean
aligned = pd.concat([features, target.rename('target')], axis=1).dropna()
X = aligned.drop('target', axis=1)
y = aligned['target']
# Scale and fit
self.scalers[horizon] = StandardScaler()
X_scaled = self.scalers[horizon].fit_transform(X)
self.models[horizon] = GradientBoostingRegressor(
n_estimators=100, max_depth=3, random_state=42
)
self.models[horizon].fit(X_scaled, y)
return self
def predict(self, df: pd.DataFrame) -> pd.DataFrame:
"""Predict volatility for all horizons."""
features = self.create_features(df)
predictions = pd.DataFrame(index=features.index)
for horizon in self.horizons:
X_scaled = self.scalers[horizon].transform(features)
predictions[f'vol_{horizon}d'] = self.models[horizon].predict(X_scaled)
return predictions
# Test
# forecaster = VolatilityForecaster([1, 5, 20])
# forecaster.fit(df)
Solution 8.3
class VolatilityForecaster:
def __init__(self, horizons: List[int] = [1, 5, 20]):
self.horizons = horizons
self.models = {}
self.scalers = {}
def create_features(self, df: pd.DataFrame) -> pd.DataFrame:
features = pd.DataFrame(index=df.index)
returns = df['Close'].pct_change()
for window in [5, 10, 20, 60]:
features[f'vol_{window}d'] = returns.rolling(window).std()
log_hl = np.log(df['High'] / df['Low'])
features['parkinson'] = np.sqrt((log_hl ** 2).rolling(20).mean() / (4 * np.log(2)))
features['volume_ratio'] = df['Volume'] / df['Volume'].rolling(20).mean()
return features.dropna()
def fit(self, df: pd.DataFrame):
features = self.create_features(df)
returns = df['Close'].pct_change()
for horizon in self.horizons:
target = returns.rolling(horizon).std().shift(-horizon)
aligned = pd.concat([features, target.rename('target')], axis=1).dropna()
X = aligned.drop('target', axis=1)
y = aligned['target']
self.scalers[horizon] = StandardScaler()
X_scaled = self.scalers[horizon].fit_transform(X)
self.models[horizon] = GradientBoostingRegressor(
n_estimators=100, max_depth=3, random_state=42
)
self.models[horizon].fit(X_scaled, y)
return self
def predict(self, df: pd.DataFrame) -> pd.DataFrame:
features = self.create_features(df)
predictions = pd.DataFrame(index=features.index)
for horizon in self.horizons:
X_scaled = self.scalers[horizon].transform(features)
predictions[f'vol_{horizon}d'] = self.models[horizon].predict(X_scaled)
return predictions
Section 4: Quantile Regression
Predict different percentiles of the return distribution for tail risk analysis.
# Quantile Regression Concepts
quantile_concepts = """
QUANTILE REGRESSION
===================
What is Quantile Regression?
----------------------------
- Predict specific percentiles instead of mean
- Estimate conditional distribution of returns
- Essential for tail risk (VaR, CVaR)
Common Quantiles:
-----------------
- q=0.01: 1% worst case (VaR 99%)
- q=0.05: 5% worst case (VaR 95%)
- q=0.50: Median (robust to outliers)
- q=0.95: 5% best case
- q=0.99: 1% best case
Loss Function:
--------------
L(y, ŷ, q) = max(q*(y-ŷ), (q-1)*(y-ŷ))
Pinball loss that asymmetrically penalizes
under and over predictions.
Applications:
-------------
- Value at Risk (VaR)
- Conditional VaR (Expected Shortfall)
- Prediction intervals
- Tail risk management
"""
print(quantile_concepts)
# Quantile Regression with Gradient Boosting
from sklearn.ensemble import GradientBoostingRegressor
quantiles = [0.05, 0.25, 0.50, 0.75, 0.95]
quantile_models = {}
for q in quantiles:
model = GradientBoostingRegressor(
loss='quantile',
alpha=q,
n_estimators=100,
max_depth=3,
random_state=42
)
model.fit(X_train_scaled, y_train)
quantile_models[q] = model
# Predict all quantiles
quantile_preds = pd.DataFrame(index=y_test.index)
for q, model in quantile_models.items():
quantile_preds[f'q{int(q*100):02d}'] = model.predict(X_test_scaled)
print("Quantile Predictions (first 5 rows):")
print(quantile_preds.head())
# Visualize quantile predictions
fig, axes = plt.subplots(2, 1, figsize=(14, 10))
# Time series with prediction intervals
sample_size = 50
sample_idx = range(len(quantile_preds) - sample_size, len(quantile_preds))
axes[0].fill_between(range(sample_size),
quantile_preds['q05'].iloc[sample_idx],
quantile_preds['q95'].iloc[sample_idx],
alpha=0.2, label='90% CI')
axes[0].fill_between(range(sample_size),
quantile_preds['q25'].iloc[sample_idx],
quantile_preds['q75'].iloc[sample_idx],
alpha=0.4, label='50% CI')
axes[0].plot(range(sample_size), quantile_preds['q50'].iloc[sample_idx],
'b-', label='Median')
axes[0].plot(range(sample_size), y_test.iloc[sample_idx].values,
'ro', markersize=4, label='Actual')
axes[0].set_xlabel('Day')
axes[0].set_ylabel('Return')
axes[0].set_title('Quantile Predictions with Confidence Intervals')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# Distribution of predictions
for q in [0.05, 0.50, 0.95]:
axes[1].hist(quantile_models[q].predict(X_test_scaled),
bins=30, alpha=0.5, label=f'q{int(q*100)}')
axes[1].axvline(x=0, color='black', linestyle='--')
axes[1].set_xlabel('Predicted Return')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Quantile Predictions')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Exercise 8.4: VaR Predictor (Open-ended)
#
# Build a VaRPredictor class that:
# - Uses quantile regression to predict VaR at different confidence levels
# - Calculates Expected Shortfall (CVaR)
# - Provides backtesting for VaR violations
# - Visualizes VaR predictions vs actual returns
# - Reports coverage statistics
#
# Your implementation:
Solution 8.4
class VaRPredictor:
"""Value at Risk prediction using quantile regression."""
def __init__(self, confidence_levels: List[float] = [0.95, 0.99]):
self.confidence_levels = confidence_levels
self.models = {}
self.scaler = StandardScaler()
def fit(self, X: pd.DataFrame, y: pd.Series):
"""Fit quantile models for each confidence level."""
X_scaled = self.scaler.fit_transform(X)
for conf in self.confidence_levels:
alpha = 1 - conf # VaR quantile
self.models[conf] = GradientBoostingRegressor(
loss='quantile',
alpha=alpha,
n_estimators=100,
max_depth=3,
random_state=42
)
self.models[conf].fit(X_scaled, y)
return self
def predict_var(self, X: pd.DataFrame) -> pd.DataFrame:
"""Predict VaR for all confidence levels."""
X_scaled = self.scaler.transform(X)
var_preds = pd.DataFrame(index=X.index)
for conf in self.confidence_levels:
var_preds[f'VaR_{int(conf*100)}'] = -self.models[conf].predict(X_scaled)
return var_preds
def backtest(self, X: pd.DataFrame, y: pd.Series) -> pd.DataFrame:
"""Backtest VaR predictions."""
var_preds = self.predict_var(X)
results = []
for conf in self.confidence_levels:
var_col = f'VaR_{int(conf*100)}'
violations = (-y.values < -var_preds[var_col].values).sum()
expected = (1 - conf) * len(y)
results.append({
'confidence': conf,
'violations': violations,
'expected': expected,
'violation_rate': violations / len(y),
'expected_rate': 1 - conf
})
return pd.DataFrame(results)
def calculate_cvar(self, X: pd.DataFrame, y: pd.Series,
confidence: float = 0.95) -> pd.Series:
"""Calculate Conditional VaR (Expected Shortfall)."""
var_preds = self.predict_var(X)
var_col = f'VaR_{int(confidence*100)}'
# CVaR = average of returns below VaR
mask = -y.values < -var_preds[var_col].values
if mask.sum() > 0:
cvar = -y[mask].mean()
else:
cvar = var_preds[var_col].mean()
return cvar
def plot_backtest(self, X: pd.DataFrame, y: pd.Series):
"""Visualize VaR backtest."""
var_preds = self.predict_var(X)
fig, axes = plt.subplots(len(self.confidence_levels), 1,
figsize=(14, 4*len(self.confidence_levels)))
if len(self.confidence_levels) == 1:
axes = [axes]
for ax, conf in zip(axes, self.confidence_levels):
var_col = f'VaR_{int(conf*100)}'
ax.plot(y.index, y.values, 'b-', alpha=0.5, label='Returns')
ax.plot(y.index, -var_preds[var_col].values, 'r-', label=var_col)
# Mark violations
violations = -y.values < -var_preds[var_col].values
ax.scatter(y.index[violations], y.values[violations],
c='red', s=50, zorder=5, label='Violations')
ax.set_title(f'VaR {int(conf*100)}% Backtest')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Exercise 8.5: Multi-Horizon Return Predictor (Open-ended)
#
# Build a MultiHorizonPredictor class that:
# - Predicts returns at multiple horizons (1, 5, 10, 20 days)
# - Uses different models for each horizon
# - Provides uncertainty estimates
# - Evaluates prediction accuracy at each horizon
# - Generates a comprehensive prediction report
#
# Your implementation:
Solution 8.5
class MultiHorizonPredictor:
"""Predict returns at multiple horizons."""
def __init__(self, horizons: List[int] = [1, 5, 10, 20]):
self.horizons = horizons
self.models = {}
self.scalers = {}
self.feature_names = None
def create_features(self, df: pd.DataFrame) -> pd.DataFrame:
"""Create features from price data."""
features = pd.DataFrame(index=df.index)
returns = df['Close'].pct_change()
features['volatility'] = returns.rolling(20).std()
for p in [5, 10, 20]:
features[f'momentum_{p}'] = df['Close'].pct_change(p)
for p in [5, 20, 50]:
ma = df['Close'].rolling(p).mean()
features[f'dist_ma{p}'] = (df['Close'] - ma) / ma
delta = df['Close'].diff()
gain = (delta.where(delta > 0, 0)).rolling(14).mean()
loss = (-delta.where(delta < 0, 0)).rolling(14).mean()
features['rsi'] = 100 - (100 / (1 + gain / loss))
return features.dropna()
def fit(self, df: pd.DataFrame):
"""Fit models for each horizon."""
features = self.create_features(df)
self.feature_names = features.columns.tolist()
for horizon in self.horizons:
# Create target
target = df['Close'].pct_change(horizon).shift(-horizon)
aligned = pd.concat([features, target.rename('target')], axis=1).dropna()
X = aligned.drop('target', axis=1)
y = aligned['target']
self.scalers[horizon] = StandardScaler()
X_scaled = self.scalers[horizon].fit_transform(X)
# Use ensemble
self.models[horizon] = {
'point': GradientBoostingRegressor(
n_estimators=100, max_depth=3, random_state=42
),
'lower': GradientBoostingRegressor(
loss='quantile', alpha=0.1, n_estimators=100,
max_depth=3, random_state=42
),
'upper': GradientBoostingRegressor(
loss='quantile', alpha=0.9, n_estimators=100,
max_depth=3, random_state=42
)
}
for model in self.models[horizon].values():
model.fit(X_scaled, y)
return self
def predict(self, df: pd.DataFrame) -> pd.DataFrame:
"""Predict returns with uncertainty."""
features = self.create_features(df)
predictions = pd.DataFrame(index=features.index)
for horizon in self.horizons:
X_scaled = self.scalers[horizon].transform(features)
predictions[f'h{horizon}_point'] = self.models[horizon]['point'].predict(X_scaled)
predictions[f'h{horizon}_lower'] = self.models[horizon]['lower'].predict(X_scaled)
predictions[f'h{horizon}_upper'] = self.models[horizon]['upper'].predict(X_scaled)
return predictions
def evaluate(self, df: pd.DataFrame, test_frac: float = 0.2) -> pd.DataFrame:
"""Evaluate predictions at each horizon."""
features = self.create_features(df)
split_idx = int(len(features) * (1 - test_frac))
results = []
for horizon in self.horizons:
target = df['Close'].pct_change(horizon).shift(-horizon)
aligned = pd.concat([features, target.rename('target')], axis=1).dropna()
test_features = aligned.drop('target', axis=1)[split_idx:]
test_target = aligned['target'][split_idx:]
X_scaled = self.scalers[horizon].transform(test_features)
y_pred = self.models[horizon]['point'].predict(X_scaled)
results.append({
'horizon': horizon,
'rmse': np.sqrt(mean_squared_error(test_target, y_pred)),
'mae': mean_absolute_error(test_target, y_pred),
'r2': r2_score(test_target, y_pred)
})
return pd.DataFrame(results)
# Exercise 8.6: Complete Regression Evaluator (Open-ended)
#
# Build a RegressionEvaluator class that:
# - Compares multiple regression models
# - Uses walk-forward validation
# - Calculates regression metrics (MSE, MAE, R2)
# - Calculates direction accuracy (sign of prediction)
# - Computes information coefficient (IC)
# - Generates visualization of residuals and predictions
#
# Your implementation:
Solution 8.6
from scipy.stats import spearmanr
class RegressionEvaluator:
"""Comprehensive regression model evaluation."""
def __init__(self, models: Dict):
self.models = models
self.results = {}
self.predictions = {}
def evaluate(self, X_train, y_train, X_test, y_test):
"""Evaluate all models."""
for name, model in self.models.items():
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
self.predictions[name] = y_pred
# Regression metrics
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
# Direction accuracy
dir_acc = ((y_test > 0) == (y_pred > 0)).mean()
# Information coefficient (rank correlation)
ic, _ = spearmanr(y_test, y_pred)
self.results[name] = {
'mse': mse,
'rmse': np.sqrt(mse),
'mae': mae,
'r2': r2,
'direction_accuracy': dir_acc,
'information_coefficient': ic
}
return self
def walk_forward_evaluate(self, X, y, train_size: int = 200,
step_size: int = 20):
"""Walk-forward evaluation."""
for name, model in self.models.items():
all_preds = []
all_actuals = []
start = 0
while start + train_size < len(X):
train_end = start + train_size
test_end = min(train_end + step_size, len(X))
X_train = X.iloc[start:train_end]
y_train_fold = y.iloc[start:train_end]
X_test = X.iloc[train_end:test_end]
y_test_fold = y.iloc[train_end:test_end]
model_clone = model.__class__(**model.get_params())
model_clone.fit(X_train, y_train_fold)
all_preds.extend(model_clone.predict(X_test))
all_actuals.extend(y_test_fold.values)
start += step_size
self.predictions[name] = np.array(all_preds)
y_test_wf = np.array(all_actuals)
mse = mean_squared_error(y_test_wf, all_preds)
ic, _ = spearmanr(y_test_wf, all_preds)
self.results[name] = {
'rmse': np.sqrt(mse),
'mae': mean_absolute_error(y_test_wf, all_preds),
'r2': r2_score(y_test_wf, all_preds),
'direction_accuracy': ((y_test_wf > 0) == (np.array(all_preds) > 0)).mean(),
'ic': ic
}
return self
def get_comparison_table(self) -> pd.DataFrame:
"""Get comparison DataFrame."""
return pd.DataFrame(self.results).T
def plot_results(self, y_test):
"""Plot predictions and residuals."""
n_models = len(self.models)
fig, axes = plt.subplots(n_models, 2, figsize=(14, 4*n_models))
if n_models == 1:
axes = axes.reshape(1, -1)
for i, (name, preds) in enumerate(self.predictions.items()):
# Scatter plot
axes[i, 0].scatter(y_test[:len(preds)], preds, alpha=0.5)
axes[i, 0].plot([y_test.min(), y_test.max()],
[y_test.min(), y_test.max()], 'r--')
axes[i, 0].set_xlabel('Actual')
axes[i, 0].set_ylabel('Predicted')
axes[i, 0].set_title(f'{name}: Actual vs Predicted')
axes[i, 0].grid(True, alpha=0.3)
# Residuals
residuals = y_test[:len(preds)].values - preds
axes[i, 1].hist(residuals, bins=30, alpha=0.7)
axes[i, 1].axvline(x=0, color='red', linestyle='--')
axes[i, 1].set_xlabel('Residual')
axes[i, 1].set_ylabel('Frequency')
axes[i, 1].set_title(f'{name}: Residual Distribution')
axes[i, 1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Module Project: Complete Return Prediction System
Build a comprehensive system for return and volatility prediction.
class ReturnPredictionSystem:
"""
Complete system for return and volatility prediction.
Features:
- Multiple model types (linear, tree, ensemble)
- Return and volatility forecasting
- Quantile predictions for risk management
- Walk-forward validation
"""
def __init__(self):
self.scaler = StandardScaler()
self.return_models = {}
self.vol_models = {}
self.quantile_models = {}
self.feature_names = None
def create_features(self, df: pd.DataFrame) -> pd.DataFrame:
"""Create comprehensive feature set."""
features = pd.DataFrame(index=df.index)
returns = df['Close'].pct_change()
# Volatility features
for w in [5, 10, 20]:
features[f'vol_{w}d'] = returns.rolling(w).std()
# Momentum features
for p in [5, 10, 20]:
features[f'mom_{p}d'] = df['Close'].pct_change(p)
# MA distances
for p in [5, 20, 50]:
ma = df['Close'].rolling(p).mean()
features[f'dist_ma{p}'] = (df['Close'] - ma) / ma
# RSI
delta = df['Close'].diff()
gain = (delta.where(delta > 0, 0)).rolling(14).mean()
loss = (-delta.where(delta < 0, 0)).rolling(14).mean()
features['rsi'] = 100 - (100 / (1 + gain / loss))
# Volume
features['vol_ratio'] = df['Volume'] / df['Volume'].rolling(20).mean()
return features.dropna()
def fit(self, df: pd.DataFrame, test_frac: float = 0.2):
"""Fit all prediction models."""
features = self.create_features(df)
self.feature_names = features.columns.tolist()
returns = df['Close'].pct_change()
volatility = returns.rolling(20).std()
# Align targets
target_return = returns.shift(-1)
target_vol = volatility.shift(-1)
combined = pd.concat([
features,
target_return.rename('target_return'),
target_vol.rename('target_vol')
], axis=1).dropna()
X = combined[self.feature_names]
y_return = combined['target_return']
y_vol = combined['target_vol']
# Split
split_idx = int(len(X) * (1 - test_frac))
self.X_train = X[:split_idx]
self.X_test = X[split_idx:]
self.y_return_train = y_return[:split_idx]
self.y_return_test = y_return[split_idx:]
self.y_vol_train = y_vol[:split_idx]
self.y_vol_test = y_vol[split_idx:]
# Scale
self.X_train_scaled = self.scaler.fit_transform(self.X_train)
self.X_test_scaled = self.scaler.transform(self.X_test)
# Train return models
self.return_models = {
'Ridge': Ridge(alpha=1.0),
'RF': RandomForestRegressor(n_estimators=100, max_depth=5, random_state=42),
'GB': GradientBoostingRegressor(n_estimators=100, max_depth=3, random_state=42)
}
for name, model in self.return_models.items():
model.fit(self.X_train_scaled, self.y_return_train)
# Train volatility models
self.vol_models = {
'Ridge': Ridge(alpha=1.0),
'GB': GradientBoostingRegressor(n_estimators=100, max_depth=3, random_state=42)
}
for name, model in self.vol_models.items():
model.fit(self.X_train_scaled, self.y_vol_train)
# Train quantile models
for q in [0.05, 0.50, 0.95]:
self.quantile_models[q] = GradientBoostingRegressor(
loss='quantile', alpha=q, n_estimators=100,
max_depth=3, random_state=42
)
self.quantile_models[q].fit(self.X_train_scaled, self.y_return_train)
return self
def evaluate(self) -> Dict:
"""Evaluate all models."""
results = {'return_models': {}, 'vol_models': {}}
for name, model in self.return_models.items():
y_pred = model.predict(self.X_test_scaled)
results['return_models'][name] = {
'rmse': np.sqrt(mean_squared_error(self.y_return_test, y_pred)),
'r2': r2_score(self.y_return_test, y_pred),
'dir_acc': ((self.y_return_test > 0) == (y_pred > 0)).mean()
}
for name, model in self.vol_models.items():
y_pred = model.predict(self.X_test_scaled)
results['vol_models'][name] = {
'rmse': np.sqrt(mean_squared_error(self.y_vol_test, y_pred)),
'r2': r2_score(self.y_vol_test, y_pred)
}
return results
def predict(self, df: pd.DataFrame) -> pd.DataFrame:
"""Generate all predictions."""
features = self.create_features(df)
X_scaled = self.scaler.transform(features)
predictions = pd.DataFrame(index=features.index)
# Ensemble return prediction
return_preds = np.zeros(len(features))
for model in self.return_models.values():
return_preds += model.predict(X_scaled) / len(self.return_models)
predictions['return_pred'] = return_preds
# Volatility prediction
predictions['vol_pred'] = self.vol_models['GB'].predict(X_scaled)
# Quantile predictions
for q, model in self.quantile_models.items():
predictions[f'q{int(q*100):02d}'] = model.predict(X_scaled)
return predictions
def plot_results(self):
"""Visualize results."""
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Return predictions
best_model = self.return_models['GB']
y_pred = best_model.predict(self.X_test_scaled)
axes[0, 0].scatter(self.y_return_test, y_pred, alpha=0.5)
axes[0, 0].plot([self.y_return_test.min(), self.y_return_test.max()],
[self.y_return_test.min(), self.y_return_test.max()], 'r--')
axes[0, 0].set_xlabel('Actual Return')
axes[0, 0].set_ylabel('Predicted Return')
axes[0, 0].set_title('Return Prediction')
axes[0, 0].grid(True, alpha=0.3)
# Volatility predictions
vol_pred = self.vol_models['GB'].predict(self.X_test_scaled)
axes[0, 1].scatter(self.y_vol_test, vol_pred, alpha=0.5)
axes[0, 1].plot([self.y_vol_test.min(), self.y_vol_test.max()],
[self.y_vol_test.min(), self.y_vol_test.max()], 'r--')
axes[0, 1].set_xlabel('Actual Volatility')
axes[0, 1].set_ylabel('Predicted Volatility')
axes[0, 1].set_title('Volatility Prediction')
axes[0, 1].grid(True, alpha=0.3)
# Quantile predictions
q05 = self.quantile_models[0.05].predict(self.X_test_scaled)
q95 = self.quantile_models[0.95].predict(self.X_test_scaled)
axes[1, 0].fill_between(range(len(q05)), q05, q95, alpha=0.3, label='90% CI')
axes[1, 0].plot(self.y_return_test.values, 'b-', alpha=0.7, label='Actual')
axes[1, 0].set_xlabel('Day')
axes[1, 0].set_ylabel('Return')
axes[1, 0].set_title('Quantile Predictions')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)
# Model comparison
results = self.evaluate()
models = list(results['return_models'].keys())
r2_scores = [results['return_models'][m]['r2'] for m in models]
dir_accs = [results['return_models'][m]['dir_acc'] for m in models]
x = np.arange(len(models))
width = 0.35
axes[1, 1].bar(x - width/2, r2_scores, width, label='R²', alpha=0.8)
axes[1, 1].bar(x + width/2, dir_accs, width, label='Dir Acc', alpha=0.8)
axes[1, 1].set_xticks(x)
axes[1, 1].set_xticklabels(models)
axes[1, 1].set_title('Model Comparison')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Test the complete system
# Get data
ticker = yf.Ticker("SPY")
data = ticker.history(period="2y")
# Create and fit system
system = ReturnPredictionSystem()
system.fit(data)
# Evaluate
results = system.evaluate()
print("Return Prediction Results:")
print(pd.DataFrame(results['return_models']).T)
print("\nVolatility Prediction Results:")
print(pd.DataFrame(results['vol_models']).T)
# Visualize results
system.plot_results()
Key Takeaways
-
Regularized linear models (Ridge, Lasso, ElasticNet) are simple baselines that often perform well
-
Tree-based regressors (Random Forest, Gradient Boosting) capture non-linear patterns
-
Volatility is more predictable than returns due to clustering effects
-
Quantile regression provides uncertainty estimates and is essential for risk management
-
Low R² is normal for return prediction; even 1-5% can be profitable
-
Direction accuracy often matters more than exact return prediction for trading
-
Walk-forward validation prevents overfitting and simulates real trading conditions
Next: Module 9 - Sentiment Analysis (Text processing, sentiment scoring, news signals)
Module 9: Sentiment Analysis
Part 3: Advanced Techniques
| Duration | Exercises | Prerequisites |
|---|---|---|
| ~2.5 hours | 6 | Modules 1-8 |
Learning Objectives
By the end of this module, you will be able to: - Process and clean financial text data - Apply sentiment scoring techniques to news and social media - Use pre-trained models for financial sentiment - Combine sentiment signals with price data - Evaluate sentiment-based trading strategies
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Dict, List, Tuple, Optional
import re
from collections import Counter
import warnings
warnings.filterwarnings('ignore')
# NLP libraries
try:
from textblob import TextBlob
HAS_TEXTBLOB = True
except ImportError:
HAS_TEXTBLOB = False
print("TextBlob not installed. Install with: pip install textblob")
try:
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('vader_lexicon', quiet=True)
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
HAS_NLTK = True
except ImportError:
HAS_NLTK = False
print("NLTK not installed. Install with: pip install nltk")
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import accuracy_score, classification_report
import yfinance as yf
print("Module 9: Sentiment Analysis")
print("=" * 40)
Section 1: Text Processing Fundamentals
Before analyzing sentiment, we need to clean and preprocess text data.
# Sentiment Analysis Concepts
sentiment_concepts = """
SENTIMENT ANALYSIS FOR FINANCE
==============================
Why Sentiment Matters:
----------------------
- Market moves on news and perception
- Sentiment can lead price movements
- Social media provides real-time signals
- News impacts trading volumes
Data Sources:
-------------
1. News Articles
- Financial news (Bloomberg, Reuters)
- Press releases
- Analyst reports
2. Social Media
- Twitter/X (high frequency)
- Reddit (r/wallstreetbets)
- StockTwits
3. Company Filings
- 10-K, 10-Q reports
- Earnings call transcripts
- Conference calls
Sentiment Scoring Methods:
--------------------------
1. Lexicon-Based
- Dictionary of positive/negative words
- VADER, Loughran-McDonald
- Fast but may miss context
2. Machine Learning
- Train classifier on labeled data
- Can capture nuance
- Needs training data
3. Deep Learning
- Transformers (BERT, FinBERT)
- State-of-the-art accuracy
- Computationally expensive
"""
print(sentiment_concepts)
# Sample financial news headlines
sample_headlines = [
"Apple reports record quarterly earnings, beats analyst expectations",
"Tech stocks tumble amid interest rate concerns",
"Federal Reserve signals potential rate cuts in 2024",
"Tesla faces production challenges, stock drops 5%",
"Microsoft's AI investments show strong returns",
"Oil prices surge on Middle East tensions",
"Retail sales disappoint, raising recession fears",
"Goldman Sachs upgrades Amazon to buy rating",
"Cryptocurrency market sees massive selloff",
"Strong jobs report eases inflation concerns",
"Boeing faces new safety investigation",
"Nvidia stock hits all-time high on AI demand",
"Bank earnings mixed amid economic uncertainty",
"Housing market shows signs of cooling",
"Disney streaming losses narrow, shares rally"
]
print(f"Sample Headlines ({len(sample_headlines)}):")
for i, headline in enumerate(sample_headlines[:5], 1):
print(f" {i}. {headline}")
# Text preprocessing functions
class TextPreprocessor:
"""Clean and preprocess financial text."""
def __init__(self):
# Common financial abbreviations to expand
self.abbreviations = {
'Q1': 'first quarter',
'Q2': 'second quarter',
'Q3': 'third quarter',
'Q4': 'fourth quarter',
'CEO': 'chief executive officer',
'CFO': 'chief financial officer',
'IPO': 'initial public offering',
'EPS': 'earnings per share',
'M&A': 'mergers and acquisitions',
'YoY': 'year over year',
'QoQ': 'quarter over quarter'
}
# Stopwords (common words to remove)
self.stopwords = set(['the', 'a', 'an', 'is', 'are', 'was', 'were',
'be', 'been', 'being', 'have', 'has', 'had',
'do', 'does', 'did', 'will', 'would', 'could',
'should', 'may', 'might', 'must', 'shall',
'and', 'or', 'but', 'if', 'then', 'else',
'when', 'where', 'why', 'how', 'what', 'which',
'who', 'whom', 'this', 'that', 'these', 'those',
'to', 'of', 'in', 'for', 'on', 'with', 'at',
'by', 'from', 'as', 'into', 'through', 'during'])
def clean(self, text: str) -> str:
"""Basic text cleaning."""
# Lowercase
text = text.lower()
# Remove URLs
text = re.sub(r'http\S+|www\S+|https\S+', '', text)
# Remove mentions and hashtags (for social media)
text = re.sub(r'@\w+|#\w+', '', text)
# Remove special characters but keep important punctuation
text = re.sub(r'[^a-zA-Z0-9\s\.\!\?\%\$]', '', text)
# Remove extra whitespace
text = ' '.join(text.split())
return text
def expand_abbreviations(self, text: str) -> str:
"""Expand financial abbreviations."""
for abbr, expansion in self.abbreviations.items():
text = re.sub(r'\b' + abbr + r'\b', expansion, text, flags=re.IGNORECASE)
return text
def remove_stopwords(self, text: str) -> str:
"""Remove common stopwords."""
words = text.split()
words = [w for w in words if w.lower() not in self.stopwords]
return ' '.join(words)
def process(self, text: str, remove_stops: bool = False) -> str:
"""Full preprocessing pipeline."""
text = self.clean(text)
text = self.expand_abbreviations(text)
if remove_stops:
text = self.remove_stopwords(text)
return text
# Test preprocessing
preprocessor = TextPreprocessor()
test_text = "Apple's Q4 EPS beats estimates! $AAPL @Bloomberg #stocks"
print(f"Original: {test_text}")
print(f"Cleaned: {preprocessor.process(test_text)}")
# Exercise 9.1: Financial Text Cleaner (Guided)
def clean_financial_text(text: str, extract_tickers: bool = True) -> Dict:
"""
Clean financial text and optionally extract stock tickers.
Returns:
Dictionary with cleaned text and extracted information
"""
result = {
'original': text,
'cleaned': '',
'tickers': [],
'numbers': [],
'percentages': []
}
# TODO: Extract stock tickers (pattern: $AAPL or just uppercase 2-5 letters)
if extract_tickers:
ticker_pattern = r'\$([A-Z]{1,5})|\b([A-Z]{2,5})\b'
matches = re.______(ticker_pattern, text)
result['tickers'] = list(set([m[0] or m[1] for m in matches if m[0] or m[1]]))
# TODO: Extract percentages
pct_pattern = r'([-+]?\d+\.?\d*)\s*%'
result['percentages'] = [float(p) for p in re.______(pct_pattern, text)]
# TODO: Extract numbers with $ sign
money_pattern = r'\$([\d,]+\.?\d*)'
result['numbers'] = re.______(money_pattern, text)
# Clean text
cleaned = text.lower()
cleaned = re.sub(r'[^a-zA-Z\s]', ' ', cleaned)
cleaned = ' '.join(cleaned.split())
result['cleaned'] = cleaned
return result
# Test
# result = clean_financial_text("$AAPL jumps 5.2% after beating Q4 estimates by $0.15")
Solution 9.1
def clean_financial_text(text: str, extract_tickers: bool = True) -> Dict:
"""
Clean financial text and optionally extract stock tickers.
"""
result = {
'original': text,
'cleaned': '',
'tickers': [],
'numbers': [],
'percentages': []
}
if extract_tickers:
ticker_pattern = r'\$([A-Z]{1,5})|\b([A-Z]{2,5})\b'
matches = re.findall(ticker_pattern, text)
result['tickers'] = list(set([m[0] or m[1] for m in matches if m[0] or m[1]]))
pct_pattern = r'([-+]?\d+\.?\d*)\s*%'
result['percentages'] = [float(p) for p in re.findall(pct_pattern, text)]
money_pattern = r'\$([\d,]+\.?\d*)'
result['numbers'] = re.findall(money_pattern, text)
cleaned = text.lower()
cleaned = re.sub(r'[^a-zA-Z\s]', ' ', cleaned)
cleaned = ' '.join(cleaned.split())
result['cleaned'] = cleaned
return result
Section 2: Lexicon-Based Sentiment
Using predefined sentiment dictionaries to score text.
# VADER Sentiment Analysis
if HAS_NLTK:
sia = SentimentIntensityAnalyzer()
print("VADER Sentiment Analysis:")
print("=" * 60)
for headline in sample_headlines[:5]:
scores = sia.polarity_scores(headline)
sentiment = 'Positive' if scores['compound'] > 0.05 else \
'Negative' if scores['compound'] < -0.05 else 'Neutral'
print(f"\n'{headline[:50]}...'")
print(f" Compound: {scores['compound']:.3f} ({sentiment})")
print(f" Pos: {scores['pos']:.3f}, Neg: {scores['neg']:.3f}, Neu: {scores['neu']:.3f}")
else:
print("NLTK not available")
# Custom Financial Sentiment Lexicon
class FinancialSentimentLexicon:
"""Custom sentiment scoring for financial text."""
def __init__(self):
# Financial-specific sentiment words
self.positive_words = {
# Strong positive
'surge', 'soar', 'rally', 'boom', 'breakthrough', 'record',
'beat', 'exceed', 'outperform', 'upgrade', 'bullish',
# Moderate positive
'gain', 'rise', 'grow', 'improve', 'strong', 'robust',
'optimistic', 'profitable', 'positive', 'success',
# Mild positive
'stable', 'steady', 'maintain', 'recovery', 'rebound'
}
self.negative_words = {
# Strong negative
'crash', 'plunge', 'collapse', 'crisis', 'disaster',
'bankrupt', 'default', 'miss', 'downgrade', 'bearish',
# Moderate negative
'fall', 'drop', 'decline', 'loss', 'weak', 'concern',
'fear', 'risk', 'warning', 'struggle',
# Mild negative
'uncertainty', 'volatility', 'challenge', 'pressure', 'disappoint'
}
# Intensifiers and negations
self.intensifiers = {'very', 'extremely', 'significantly', 'sharply', 'dramatically'}
self.negations = {'not', 'no', 'never', 'neither', "n't", 'without', 'lack'}
# Word weights
self.word_weights = {
# Strong words get higher weights
'surge': 2.0, 'crash': -2.0, 'record': 1.5, 'crisis': -1.5,
'beat': 1.2, 'miss': -1.2, 'upgrade': 1.5, 'downgrade': -1.5
}
def score(self, text: str) -> Dict:
"""Score sentiment of text."""
text_lower = text.lower()
words = text_lower.split()
positive_count = 0
negative_count = 0
weighted_score = 0
prev_word = ''
for word in words:
# Check for negation
negated = prev_word in self.negations
intensified = prev_word in self.intensifiers
multiplier = 1.5 if intensified else 1.0
if negated:
multiplier *= -1
if word in self.positive_words:
weight = self.word_weights.get(word, 1.0)
positive_count += 1
weighted_score += weight * multiplier
elif word in self.negative_words:
weight = self.word_weights.get(word, -1.0)
negative_count += 1
weighted_score += weight * multiplier
prev_word = word
total_words = len(words)
return {
'positive_count': positive_count,
'negative_count': negative_count,
'weighted_score': weighted_score,
'normalized_score': weighted_score / total_words if total_words > 0 else 0,
'sentiment': 'positive' if weighted_score > 0.5 else
'negative' if weighted_score < -0.5 else 'neutral'
}
# Test
fin_lexicon = FinancialSentimentLexicon()
print("Financial Sentiment Lexicon:")
print("=" * 60)
for headline in sample_headlines[:5]:
scores = fin_lexicon.score(headline)
print(f"\n'{headline[:50]}...'")
print(f" Score: {scores['weighted_score']:.2f} ({scores['sentiment']})")
print(f" Pos words: {scores['positive_count']}, Neg words: {scores['negative_count']}")
# Exercise 9.2: Sentiment Scorer (Guided)
class SentimentScorer:
"""
Combined sentiment scoring using multiple methods.
"""
def __init__(self):
self.vader = SentimentIntensityAnalyzer() if HAS_NLTK else None
self.fin_lexicon = FinancialSentimentLexicon()
def score_vader(self, text: str) -> float:
"""Get VADER compound score."""
if self.vader:
# TODO: Get VADER polarity scores and return compound
scores = self.vader.______(text)
return scores['______']
return 0.0
def score_financial(self, text: str) -> float:
"""Get financial lexicon score."""
# TODO: Get financial lexicon scores and return normalized score
scores = self.fin_lexicon.______(text)
return scores['______']
def score_combined(self, text: str, vader_weight: float = 0.5) -> Dict:
"""Combine VADER and financial lexicon scores."""
vader_score = self.score_vader(text)
fin_score = self.score_financial(text)
# Normalize financial score to [-1, 1] range
fin_normalized = np.clip(fin_score / 2, -1, 1)
# Combined score
combined = vader_weight * vader_score + (1 - vader_weight) * fin_normalized
return {
'vader': vader_score,
'financial': fin_score,
'combined': combined,
'sentiment': 'positive' if combined > 0.1 else 'negative' if combined < -0.1 else 'neutral'
}
# Test
# scorer = SentimentScorer()
# result = scorer.score_combined("Apple reports record earnings")
Solution 9.2
class SentimentScorer:
def __init__(self):
self.vader = SentimentIntensityAnalyzer() if HAS_NLTK else None
self.fin_lexicon = FinancialSentimentLexicon()
def score_vader(self, text: str) -> float:
if self.vader:
scores = self.vader.polarity_scores(text)
return scores['compound']
return 0.0
def score_financial(self, text: str) -> float:
scores = self.fin_lexicon.score(text)
return scores['normalized_score']
def score_combined(self, text: str, vader_weight: float = 0.5) -> Dict:
vader_score = self.score_vader(text)
fin_score = self.score_financial(text)
fin_normalized = np.clip(fin_score / 2, -1, 1)
combined = vader_weight * vader_score + (1 - vader_weight) * fin_normalized
return {
'vader': vader_score,
'financial': fin_score,
'combined': combined,
'sentiment': 'positive' if combined > 0.1 else 'negative' if combined < -0.1 else 'neutral'
}
Section 3: News Sentiment Features
Creating tradeable features from news sentiment.
# Simulate news data with timestamps
def generate_simulated_news(n_days: int = 252) -> pd.DataFrame:
"""Generate simulated news data for demonstration."""
np.random.seed(42)
dates = pd.date_range(end=pd.Timestamp.today(), periods=n_days, freq='B')
# Templates
positive_templates = [
"Stock surges on strong earnings report",
"Analysts upgrade rating to buy",
"Company announces record quarterly revenue",
"Shares rally after positive guidance",
"Investor optimism grows on deal news"
]
negative_templates = [
"Stock drops on disappointing results",
"Analysts downgrade amid concerns",
"Shares tumble on weak guidance",
"Company faces regulatory challenges",
"Investors worry about debt levels"
]
neutral_templates = [
"Company reports inline with expectations",
"Stock trades sideways on mixed signals",
"Analysts maintain hold rating",
"Market awaits upcoming earnings release",
"Trading volume remains steady"
]
news_data = []
for date in dates:
# Generate 1-5 news items per day
n_news = np.random.randint(1, 6)
for _ in range(n_news):
# Randomly select sentiment
sentiment_type = np.random.choice(['positive', 'negative', 'neutral'], p=[0.35, 0.35, 0.3])
if sentiment_type == 'positive':
headline = np.random.choice(positive_templates)
elif sentiment_type == 'negative':
headline = np.random.choice(negative_templates)
else:
headline = np.random.choice(neutral_templates)
news_data.append({
'date': date,
'headline': headline,
'true_sentiment': sentiment_type
})
return pd.DataFrame(news_data)
# Generate news
news_df = generate_simulated_news()
print(f"Generated {len(news_df)} news items over {news_df['date'].nunique()} days")
print(f"\nSample:")
print(news_df.head(10).to_string(index=False))
# Create sentiment features from news
def create_sentiment_features(news_df: pd.DataFrame) -> pd.DataFrame:
"""Aggregate news sentiment into daily features."""
scorer = SentimentScorer()
# Score each headline
news_df['sentiment_score'] = news_df['headline'].apply(
lambda x: scorer.score_combined(x)['combined']
)
# Aggregate by date
daily_features = news_df.groupby('date').agg({
'sentiment_score': ['mean', 'std', 'min', 'max', 'count'],
'headline': 'count'
}).reset_index()
# Flatten column names
daily_features.columns = [
'date', 'sentiment_mean', 'sentiment_std', 'sentiment_min',
'sentiment_max', 'sentiment_count', 'news_count'
]
# Calculate additional features
daily_features['sentiment_range'] = daily_features['sentiment_max'] - daily_features['sentiment_min']
daily_features['sentiment_skew'] = daily_features['sentiment_mean'] - \
(daily_features['sentiment_max'] + daily_features['sentiment_min']) / 2
# Rolling features
daily_features = daily_features.set_index('date').sort_index()
daily_features['sentiment_ma3'] = daily_features['sentiment_mean'].rolling(3).mean()
daily_features['sentiment_ma7'] = daily_features['sentiment_mean'].rolling(7).mean()
daily_features['sentiment_momentum'] = daily_features['sentiment_mean'] - daily_features['sentiment_ma7']
return daily_features.dropna()
# Create features
sentiment_features = create_sentiment_features(news_df)
print(f"Daily sentiment features: {sentiment_features.shape}")
print(f"\nFeatures:")
print(sentiment_features.head())
# Visualize sentiment over time
fig, axes = plt.subplots(2, 1, figsize=(14, 8))
# Sentiment mean
axes[0].plot(sentiment_features.index, sentiment_features['sentiment_mean'],
'b-', alpha=0.7, label='Daily Mean')
axes[0].plot(sentiment_features.index, sentiment_features['sentiment_ma7'],
'r-', linewidth=2, label='7-Day MA')
axes[0].axhline(y=0, color='gray', linestyle='--')
axes[0].fill_between(sentiment_features.index,
sentiment_features['sentiment_min'],
sentiment_features['sentiment_max'],
alpha=0.2, label='Min-Max Range')
axes[0].set_ylabel('Sentiment Score')
axes[0].set_title('Daily News Sentiment')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# News volume
axes[1].bar(sentiment_features.index, sentiment_features['news_count'],
alpha=0.7, color='steelblue')
axes[1].set_ylabel('News Count')
axes[1].set_title('Daily News Volume')
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Exercise 9.3: Sentiment Feature Engineer (Guided)
def create_advanced_sentiment_features(news_df: pd.DataFrame,
lookback_days: List[int] = [3, 7, 14]) -> pd.DataFrame:
"""
Create advanced sentiment features with multiple lookback periods.
"""
scorer = SentimentScorer()
# Score headlines
news_df = news_df.copy()
news_df['score'] = news_df['headline'].apply(
lambda x: scorer.score_combined(x)['combined']
)
# TODO: Aggregate by date
daily = news_df.groupby('date').agg({
'score': ['mean', 'std', 'count']
})
daily.columns = ['sentiment', 'sentiment_std', 'news_count']
daily = daily.______()
# Fill missing dates
full_dates = pd.date_range(daily.index.min(), daily.index.max(), freq='B')
daily = daily.reindex(full_dates)
daily['sentiment'] = daily['sentiment'].fillna(0)
daily['news_count'] = daily['news_count'].fillna(0)
# TODO: Create rolling features for each lookback period
for days in lookback_days:
daily[f'sentiment_ma{days}'] = daily['sentiment'].______(days).______()
daily[f'sentiment_vol{days}'] = daily['sentiment'].______(days).______()
# Sentiment momentum
daily['sentiment_momentum'] = daily['sentiment'] - daily['sentiment_ma7']
# Sentiment acceleration
daily['sentiment_accel'] = daily['sentiment_momentum'].diff()
return daily.dropna()
# Test
# advanced_features = create_advanced_sentiment_features(news_df)
Solution 9.3
def create_advanced_sentiment_features(news_df: pd.DataFrame,
lookback_days: List[int] = [3, 7, 14]) -> pd.DataFrame:
scorer = SentimentScorer()
news_df = news_df.copy()
news_df['score'] = news_df['headline'].apply(
lambda x: scorer.score_combined(x)['combined']
)
daily = news_df.groupby('date').agg({
'score': ['mean', 'std', 'count']
})
daily.columns = ['sentiment', 'sentiment_std', 'news_count']
daily = daily.sort_index()
full_dates = pd.date_range(daily.index.min(), daily.index.max(), freq='B')
daily = daily.reindex(full_dates)
daily['sentiment'] = daily['sentiment'].fillna(0)
daily['news_count'] = daily['news_count'].fillna(0)
for days in lookback_days:
daily[f'sentiment_ma{days}'] = daily['sentiment'].rolling(days).mean()
daily[f'sentiment_vol{days}'] = daily['sentiment'].rolling(days).std()
daily['sentiment_momentum'] = daily['sentiment'] - daily['sentiment_ma7']
daily['sentiment_accel'] = daily['sentiment_momentum'].diff()
return daily.dropna()
Section 4: Sentiment Trading Signals
Combining sentiment with price data for trading signals.
# Combine sentiment with price data
def combine_sentiment_with_price(sentiment_df: pd.DataFrame,
symbol: str = "SPY") -> pd.DataFrame:
"""Combine sentiment features with price data."""
# Get price data
ticker = yf.Ticker(symbol)
price_df = ticker.history(period="1y")
# Calculate price features
price_df['returns'] = price_df['Close'].pct_change()
price_df['volatility'] = price_df['returns'].rolling(20).std()
for p in [5, 10, 20]:
price_df[f'momentum_{p}'] = price_df['Close'].pct_change(p)
# Target: next day direction
price_df['target'] = (price_df['returns'].shift(-1) > 0).astype(int)
# Merge with sentiment
price_df.index = price_df.index.tz_localize(None)
sentiment_df.index = pd.to_datetime(sentiment_df.index).tz_localize(None)
combined = price_df.join(sentiment_df, how='left')
# Fill missing sentiment with 0
sentiment_cols = sentiment_df.columns
combined[sentiment_cols] = combined[sentiment_cols].fillna(0)
return combined.dropna()
# Combine data
combined_df = combine_sentiment_with_price(sentiment_features)
print(f"Combined data: {combined_df.shape}")
print(f"\nColumns: {combined_df.columns.tolist()}")
# Build sentiment-enhanced prediction model
def build_sentiment_model(combined_df: pd.DataFrame) -> Dict:
"""Build and evaluate sentiment-enhanced model."""
# Define features
price_features = ['volatility', 'momentum_5', 'momentum_10', 'momentum_20']
sentiment_features_cols = ['sentiment_mean', 'sentiment_ma7', 'sentiment_momentum']
# Available columns
price_available = [f for f in price_features if f in combined_df.columns]
sentiment_available = [f for f in sentiment_features_cols if f in combined_df.columns]
# Models
results = {}
# Split
split_idx = int(len(combined_df) * 0.8)
scaler = StandardScaler()
# Model 1: Price only
if price_available:
X_price = combined_df[price_available]
y = combined_df['target']
X_train, X_test = X_price[:split_idx], X_price[split_idx:]
y_train, y_test = y[:split_idx], y[split_idx:]
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
model.fit(X_train_scaled, y_train)
results['price_only'] = {
'accuracy': accuracy_score(y_test, model.predict(X_test_scaled)),
'features': price_available
}
# Model 2: Price + Sentiment
all_features = price_available + sentiment_available
if all_features:
X_all = combined_df[all_features]
X_train, X_test = X_all[:split_idx], X_all[split_idx:]
y_train, y_test = y[:split_idx], y[split_idx:]
scaler_all = StandardScaler()
X_train_scaled = scaler_all.fit_transform(X_train)
X_test_scaled = scaler_all.transform(X_test)
model_all = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
model_all.fit(X_train_scaled, y_train)
results['price_sentiment'] = {
'accuracy': accuracy_score(y_test, model_all.predict(X_test_scaled)),
'features': all_features,
'feature_importance': dict(zip(all_features, model_all.feature_importances_))
}
return results
# Build models
model_results = build_sentiment_model(combined_df)
print("Model Comparison:")
print("=" * 40)
for name, result in model_results.items():
print(f"\n{name}:")
print(f" Accuracy: {result['accuracy']:.2%}")
print(f" Features: {result['features']}")
if 'feature_importance' in result:
print(f" Top Features:")
for feat, imp in sorted(result['feature_importance'].items(), key=lambda x: -x[1])[:3]:
print(f" {feat}: {imp:.4f}")
# Exercise 9.4: Sentiment Trading System (Open-ended)
#
# Build a SentimentTradingSystem class that:
# - Processes news headlines to extract sentiment
# - Creates daily sentiment features
# - Combines with price data for signals
# - Generates buy/sell signals based on sentiment thresholds
# - Backtests the strategy and reports performance
#
# Your implementation:
Solution 9.4
class SentimentTradingSystem:
"""Trading system based on news sentiment."""
def __init__(self, buy_threshold: float = 0.2, sell_threshold: float = -0.2):
self.buy_threshold = buy_threshold
self.sell_threshold = sell_threshold
self.scorer = SentimentScorer()
self.model = None
def score_news(self, news_df: pd.DataFrame) -> pd.DataFrame:
"""Score all news headlines."""
news_df = news_df.copy()
news_df['sentiment'] = news_df['headline'].apply(
lambda x: self.scorer.score_combined(x)['combined']
)
return news_df
def aggregate_daily(self, news_df: pd.DataFrame) -> pd.DataFrame:
"""Aggregate to daily sentiment."""
daily = news_df.groupby('date').agg({
'sentiment': ['mean', 'std', 'count']
})
daily.columns = ['sentiment', 'sentiment_std', 'news_count']
daily['sentiment_ma5'] = daily['sentiment'].rolling(5).mean()
daily['sentiment_momentum'] = daily['sentiment'] - daily['sentiment_ma5']
return daily.dropna()
def generate_signals(self, sentiment_df: pd.DataFrame) -> pd.DataFrame:
"""Generate trading signals."""
signals = pd.DataFrame(index=sentiment_df.index)
signals['sentiment'] = sentiment_df['sentiment']
signals['signal'] = 0
# Buy signal
signals.loc[sentiment_df['sentiment'] > self.buy_threshold, 'signal'] = 1
# Sell signal
signals.loc[sentiment_df['sentiment'] < self.sell_threshold, 'signal'] = -1
signals['position'] = signals['signal'].replace(0, np.nan).ffill().fillna(0)
return signals
def backtest(self, signals: pd.DataFrame, prices: pd.DataFrame) -> pd.DataFrame:
"""Backtest the sentiment strategy."""
prices.index = prices.index.tz_localize(None)
aligned = signals.join(prices[['Close']], how='inner')
aligned['returns'] = aligned['Close'].pct_change()
aligned['strategy_returns'] = aligned['position'].shift(1) * aligned['returns']
aligned['cum_returns'] = (1 + aligned['returns']).cumprod()
aligned['cum_strategy'] = (1 + aligned['strategy_returns'].fillna(0)).cumprod()
return aligned
def evaluate(self, backtest_results: pd.DataFrame) -> Dict:
"""Evaluate strategy performance."""
strat_rets = backtest_results['strategy_returns'].dropna()
return {
'total_return': backtest_results['cum_strategy'].iloc[-1] - 1,
'buy_hold_return': backtest_results['cum_returns'].iloc[-1] - 1,
'sharpe': np.sqrt(252) * strat_rets.mean() / strat_rets.std(),
'win_rate': (strat_rets > 0).mean(),
'n_trades': (backtest_results['signal'] != 0).sum()
}
def plot_results(self, backtest_results: pd.DataFrame):
"""Visualize results."""
fig, axes = plt.subplots(2, 1, figsize=(14, 8))
axes[0].plot(backtest_results['cum_strategy'], label='Strategy')
axes[0].plot(backtest_results['cum_returns'], label='Buy & Hold')
axes[0].set_title('Sentiment Strategy Performance')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
axes[1].plot(backtest_results['sentiment'], alpha=0.7)
axes[1].axhline(y=self.buy_threshold, color='green', linestyle='--')
axes[1].axhline(y=self.sell_threshold, color='red', linestyle='--')
axes[1].set_title('Sentiment with Thresholds')
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Exercise 9.5: News Impact Analyzer (Open-ended)
#
# Build a NewsImpactAnalyzer class that:
# - Measures price impact after news events
# - Categorizes news by sentiment intensity
# - Calculates average returns for each sentiment category
# - Identifies which types of news have the most impact
# - Provides statistical significance tests
#
# Your implementation:
Solution 9.5
from scipy import stats
class NewsImpactAnalyzer:
"""Analyze impact of news on prices."""
def __init__(self, impact_windows: List[int] = [1, 3, 5]):
self.impact_windows = impact_windows
self.scorer = SentimentScorer()
def score_and_categorize(self, news_df: pd.DataFrame) -> pd.DataFrame:
"""Score news and categorize by intensity."""
news_df = news_df.copy()
news_df['sentiment'] = news_df['headline'].apply(
lambda x: self.scorer.score_combined(x)['combined']
)
# Categorize
news_df['category'] = pd.cut(
news_df['sentiment'],
bins=[-np.inf, -0.3, -0.1, 0.1, 0.3, np.inf],
labels=['Very Negative', 'Negative', 'Neutral', 'Positive', 'Very Positive']
)
return news_df
def calculate_impact(self, news_df: pd.DataFrame,
price_df: pd.DataFrame) -> pd.DataFrame:
"""Calculate price impact for each news item."""
news_df = news_df.copy()
price_df.index = price_df.index.tz_localize(None)
for window in self.impact_windows:
impacts = []
for _, row in news_df.iterrows():
date = row['date']
if date in price_df.index:
try:
idx = price_df.index.get_loc(date)
if idx + window < len(price_df):
impact = (
price_df.iloc[idx + window]['Close'] /
price_df.iloc[idx]['Close'] - 1
)
else:
impact = np.nan
except:
impact = np.nan
else:
impact = np.nan
impacts.append(impact)
news_df[f'impact_{window}d'] = impacts
return news_df
def analyze_by_category(self, news_df: pd.DataFrame) -> pd.DataFrame:
"""Analyze impact by sentiment category."""
results = []
for category in news_df['category'].unique():
cat_df = news_df[news_df['category'] == category]
for window in self.impact_windows:
impact_col = f'impact_{window}d'
impacts = cat_df[impact_col].dropna()
if len(impacts) > 0:
# T-test against zero
t_stat, p_value = stats.ttest_1samp(impacts, 0)
results.append({
'category': category,
'window': window,
'mean_impact': impacts.mean(),
'std_impact': impacts.std(),
'n_samples': len(impacts),
't_stat': t_stat,
'p_value': p_value,
'significant': p_value < 0.05
})
return pd.DataFrame(results)
def plot_impact(self, analysis_df: pd.DataFrame):
"""Visualize impact by category."""
pivot = analysis_df.pivot(
index='category',
columns='window',
values='mean_impact'
)
plt.figure(figsize=(10, 6))
pivot.plot(kind='bar', ax=plt.gca())
plt.title('Average Price Impact by Sentiment Category')
plt.xlabel('Sentiment Category')
plt.ylabel('Mean Return')
plt.legend(title='Days After')
plt.xticks(rotation=45)
plt.axhline(y=0, color='black', linestyle='--')
plt.tight_layout()
plt.show()
# Exercise 9.6: Complete Sentiment Pipeline (Open-ended)
#
# Build a SentimentPipeline class that:
# - Ingests raw news data
# - Cleans and preprocesses text
# - Scores sentiment using multiple methods
# - Creates tradeable features
# - Builds and evaluates ML models
# - Generates signals and backtests
# - Produces a comprehensive report
#
# Your implementation:
Solution 9.6
class SentimentPipeline:
"""End-to-end sentiment analysis pipeline."""
def __init__(self):
self.preprocessor = TextPreprocessor()
self.scorer = SentimentScorer()
self.model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
self.scaler = StandardScaler()
self.results = {}
def preprocess(self, news_df: pd.DataFrame) -> pd.DataFrame:
"""Clean and preprocess news."""
news_df = news_df.copy()
news_df['cleaned'] = news_df['headline'].apply(self.preprocessor.process)
return news_df
def score(self, news_df: pd.DataFrame) -> pd.DataFrame:
"""Score sentiment."""
news_df['sentiment'] = news_df['headline'].apply(
lambda x: self.scorer.score_combined(x)['combined']
)
return news_df
def create_features(self, news_df: pd.DataFrame) -> pd.DataFrame:
"""Create daily features."""
daily = news_df.groupby('date').agg({
'sentiment': ['mean', 'std', 'min', 'max', 'count']
})
daily.columns = ['sent_mean', 'sent_std', 'sent_min', 'sent_max', 'news_count']
# Rolling features
for w in [3, 7, 14]:
daily[f'sent_ma{w}'] = daily['sent_mean'].rolling(w).mean()
daily[f'sent_vol{w}'] = daily['sent_mean'].rolling(w).std()
daily['sent_momentum'] = daily['sent_mean'] - daily['sent_ma7']
return daily.dropna()
def combine_with_price(self, features: pd.DataFrame,
symbol: str = 'SPY') -> pd.DataFrame:
"""Combine with price data."""
ticker = yf.Ticker(symbol)
prices = ticker.history(period='1y')
prices['returns'] = prices['Close'].pct_change()
prices['volatility'] = prices['returns'].rolling(20).std()
prices['target'] = (prices['returns'].shift(-1) > 0).astype(int)
prices.index = prices.index.tz_localize(None)
combined = prices.join(features, how='left').dropna()
return combined
def train_model(self, combined: pd.DataFrame, test_frac: float = 0.2):
"""Train prediction model."""
feature_cols = ['volatility', 'sent_mean', 'sent_ma7', 'sent_momentum']
available = [c for c in feature_cols if c in combined.columns]
X = combined[available]
y = combined['target']
split_idx = int(len(X) * (1 - test_frac))
X_train, X_test = X[:split_idx], X[split_idx:]
y_train, y_test = y[:split_idx], y[split_idx:]
X_train_scaled = self.scaler.fit_transform(X_train)
X_test_scaled = self.scaler.transform(X_test)
self.model.fit(X_train_scaled, y_train)
self.results['train_accuracy'] = self.model.score(X_train_scaled, y_train)
self.results['test_accuracy'] = self.model.score(X_test_scaled, y_test)
self.results['feature_importance'] = dict(zip(available, self.model.feature_importances_))
return self
def run_pipeline(self, news_df: pd.DataFrame, symbol: str = 'SPY') -> Dict:
"""Run full pipeline."""
print("Preprocessing...")
news_df = self.preprocess(news_df)
print("Scoring sentiment...")
news_df = self.score(news_df)
print("Creating features...")
features = self.create_features(news_df)
print("Combining with prices...")
combined = self.combine_with_price(features, symbol)
print("Training model...")
self.train_model(combined)
print("\nPipeline Complete!")
return self.results
def generate_report(self) -> str:
"""Generate text report."""
report = f"""Sentiment Pipeline Report
========================
Model Performance:
Train Accuracy: {self.results.get('train_accuracy', 0):.2%}
Test Accuracy: {self.results.get('test_accuracy', 0):.2%}
Feature Importance:
"""
for feat, imp in sorted(
self.results.get('feature_importance', {}).items(),
key=lambda x: -x[1]
):
report += f" {feat}: {imp:.4f}\n"
return report
Module Project: News Sentiment Trading System
Build a complete system that combines news sentiment analysis with trading signals.
class NewsSentimentTradingSystem:
"""
Complete news sentiment trading system.
Features:
- Multi-method sentiment scoring
- Feature engineering for sentiment
- ML model for signal generation
- Backtesting and performance analysis
"""
def __init__(self):
self.preprocessor = TextPreprocessor()
self.scorer = SentimentScorer()
self.model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
self.scaler = StandardScaler()
def process_news(self, news_df: pd.DataFrame) -> pd.DataFrame:
"""Process raw news data."""
news_df = news_df.copy()
# Clean text
news_df['cleaned'] = news_df['headline'].apply(self.preprocessor.process)
# Score sentiment
news_df['sentiment'] = news_df['headline'].apply(
lambda x: self.scorer.score_combined(x)['combined']
)
return news_df
def create_daily_features(self, news_df: pd.DataFrame) -> pd.DataFrame:
"""Aggregate news to daily features."""
daily = news_df.groupby('date').agg({
'sentiment': ['mean', 'std', 'min', 'max', 'count']
})
daily.columns = ['sent_mean', 'sent_std', 'sent_min', 'sent_max', 'news_count']
daily = daily.sort_index()
# Rolling features
for window in [3, 7, 14]:
daily[f'sent_ma{window}'] = daily['sent_mean'].rolling(window).mean()
# Momentum and volatility
daily['sent_momentum'] = daily['sent_mean'] - daily['sent_ma7']
daily['sent_vol7'] = daily['sent_mean'].rolling(7).std()
return daily.dropna()
def prepare_training_data(self, sentiment_features: pd.DataFrame,
symbol: str = "SPY") -> pd.DataFrame:
"""Prepare training data with price and sentiment."""
# Get price data
ticker = yf.Ticker(symbol)
prices = ticker.history(period="1y")
# Price features
prices['returns'] = prices['Close'].pct_change()
prices['volatility'] = prices['returns'].rolling(20).std()
prices['momentum_5'] = prices['Close'].pct_change(5)
prices['momentum_20'] = prices['Close'].pct_change(20)
# Target
prices['target'] = (prices['returns'].shift(-1) > 0).astype(int)
# Merge
prices.index = prices.index.tz_localize(None)
combined = prices.join(sentiment_features, how='left')
# Fill missing sentiment
sent_cols = sentiment_features.columns
combined[sent_cols] = combined[sent_cols].fillna(0)
return combined.dropna()
def fit(self, combined_df: pd.DataFrame, test_frac: float = 0.2):
"""Train the trading model."""
# Features
feature_cols = ['volatility', 'momentum_5', 'momentum_20',
'sent_mean', 'sent_ma7', 'sent_momentum', 'sent_vol7']
available = [c for c in feature_cols if c in combined_df.columns]
X = combined_df[available]
y = combined_df['target']
# Split
split_idx = int(len(X) * (1 - test_frac))
self.X_train = X[:split_idx]
self.X_test = X[split_idx:]
self.y_train = y[:split_idx]
self.y_test = y[split_idx:]
self.test_returns = combined_df['returns'][split_idx:]
# Scale and train
X_train_scaled = self.scaler.fit_transform(self.X_train)
self.model.fit(X_train_scaled, self.y_train)
self.feature_names = available
return self
def evaluate(self) -> Dict:
"""Evaluate model performance."""
X_train_scaled = self.scaler.transform(self.X_train)
X_test_scaled = self.scaler.transform(self.X_test)
y_pred = self.model.predict(X_test_scaled)
# Classification metrics
train_acc = self.model.score(X_train_scaled, self.y_train)
test_acc = self.model.score(X_test_scaled, self.y_test)
# Financial metrics
pred_series = pd.Series(y_pred, index=self.y_test.index)
strategy_returns = pred_series.shift(1) * self.test_returns
strategy_returns = strategy_returns.dropna()
total_return = (1 + strategy_returns).cumprod().iloc[-1] - 1
bh_return = (1 + self.test_returns.loc[strategy_returns.index]).cumprod().iloc[-1] - 1
sharpe = np.sqrt(252) * strategy_returns.mean() / strategy_returns.std()
return {
'train_accuracy': train_acc,
'test_accuracy': test_acc,
'total_return': total_return,
'buy_hold_return': bh_return,
'outperformance': total_return - bh_return,
'sharpe_ratio': sharpe,
'feature_importance': dict(zip(self.feature_names, self.model.feature_importances_))
}
def plot_results(self):
"""Visualize results."""
X_test_scaled = self.scaler.transform(self.X_test)
y_pred = self.model.predict(X_test_scaled)
pred_series = pd.Series(y_pred, index=self.y_test.index)
strategy_returns = pred_series.shift(1) * self.test_returns
cum_strategy = (1 + strategy_returns.fillna(0)).cumprod()
cum_bh = (1 + self.test_returns.fillna(0)).cumprod()
fig, axes = plt.subplots(2, 1, figsize=(14, 10))
# Cumulative returns
axes[0].plot(cum_strategy.index, cum_strategy, label='Strategy', linewidth=2)
axes[0].plot(cum_bh.index, cum_bh, label='Buy & Hold', linewidth=2, alpha=0.7)
axes[0].set_ylabel('Cumulative Return')
axes[0].set_title('Sentiment Trading Strategy Performance')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# Feature importance
importance = pd.Series(
self.model.feature_importances_,
index=self.feature_names
).sort_values()
axes[1].barh(importance.index, importance.values, color='steelblue')
axes[1].set_xlabel('Feature Importance')
axes[1].set_title('Model Feature Importance')
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Run the complete system
# Generate simulated news
news_data = generate_simulated_news(252)
# Create system
system = NewsSentimentTradingSystem()
# Process news
print("Processing news...")
processed_news = system.process_news(news_data)
# Create features
print("Creating features...")
sentiment_features = system.create_daily_features(processed_news)
# Prepare training data
print("Preparing training data...")
combined = system.prepare_training_data(sentiment_features)
# Train
print("Training model...")
system.fit(combined)
# Evaluate
results = system.evaluate()
print("\n" + "="*50)
print("SENTIMENT TRADING SYSTEM RESULTS")
print("="*50)
print(f"\nClassification Metrics:")
print(f" Train Accuracy: {results['train_accuracy']:.2%}")
print(f" Test Accuracy: {results['test_accuracy']:.2%}")
print(f"\nFinancial Metrics:")
print(f" Strategy Return: {results['total_return']:.2%}")
print(f" Buy & Hold: {results['buy_hold_return']:.2%}")
print(f" Outperformance: {results['outperformance']:.2%}")
print(f" Sharpe Ratio: {results['sharpe_ratio']:.2f}")
print(f"\nTop Features:")
for feat, imp in sorted(results['feature_importance'].items(), key=lambda x: -x[1])[:5]:
print(f" {feat}: {imp:.4f}")
# Visualize
system.plot_results()
Key Takeaways
-
Text preprocessing is critical: clean, normalize, and extract relevant entities from financial text
-
Lexicon-based methods (VADER, custom dictionaries) are fast but may miss context
-
Financial-specific lexicons outperform general sentiment tools for market data
-
Sentiment features should include means, volatility, momentum, and rolling statistics
-
Combining sentiment with price features often improves prediction accuracy
-
News impact analysis helps identify which sentiment signals are most predictive
-
Real-time news sources (Twitter, news APIs) provide actionable signals but require careful latency management
Next: Module 10 - Alternative Data (Web scraping, social media, multi-source features)
Module 10: Alternative Data
Part 3: Advanced Techniques
| Duration | Exercises | Prerequisites |
|---|---|---|
| ~2.5 hours | 6 | Modules 1-9 |
Learning Objectives
By the end of this module, you will be able to: - Understand alternative data sources for trading - Collect data from web and API sources - Process and clean social media data - Combine multiple data sources into features - Build multi-source prediction models
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Dict, List, Tuple, Optional
import json
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')
# Data collection
try:
import requests
HAS_REQUESTS = True
except ImportError:
HAS_REQUESTS = False
print("requests not installed. Install with: pip install requests")
try:
from bs4 import BeautifulSoup
HAS_BS4 = True
except ImportError:
HAS_BS4 = False
print("BeautifulSoup not installed. Install with: pip install beautifulsoup4")
# ML
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import accuracy_score, classification_report
import yfinance as yf
print("Module 10: Alternative Data")
print("=" * 40)
Section 1: Alternative Data Sources
Understanding the landscape of non-traditional data for trading.
# Alternative Data Overview
alt_data_overview = """
ALTERNATIVE DATA FOR TRADING
============================
What is Alternative Data?
-------------------------
Non-traditional data sources beyond price/volume that provide
insights into economic activity, company performance, or sentiment.
Categories:
-----------
1. SOCIAL & SENTIMENT
- Twitter/X posts and trends
- Reddit (r/wallstreetbets, r/investing)
- StockTwits messages
- News headlines and articles
- Analyst reports
2. WEB DATA
- Google Trends search volume
- Website traffic (SimilarWeb)
- Job postings (LinkedIn, Indeed)
- Product reviews and ratings
- Price comparison sites
3. TRANSACTION DATA
- Credit card transactions
- Point of sale data
- App usage metrics
- Email receipts
4. GEOSPATIAL DATA
- Satellite imagery
- GPS/location data
- Foot traffic counts
- Shipping/logistics tracking
5. GOVERNMENT & ECONOMIC
- SEC filings
- Patent applications
- Building permits
- Import/export data
Considerations:
---------------
- Data quality and consistency
- Latency and timeliness
- Cost of acquisition
- Legal/compliance issues
- Alpha decay (as data becomes common)
"""
print(alt_data_overview)
# Simulate alternative data sources
def generate_simulated_alt_data(n_days: int = 252, symbol: str = "AAPL") -> Dict[str, pd.DataFrame]:
"""Generate simulated alternative data for demonstration."""
np.random.seed(42)
dates = pd.date_range(end=pd.Timestamp.today(), periods=n_days, freq='B')
# 1. Social Media Metrics
social_data = pd.DataFrame({
'date': dates,
'twitter_mentions': np.random.poisson(500, n_days),
'twitter_sentiment': np.random.normal(0.1, 0.3, n_days),
'reddit_posts': np.random.poisson(50, n_days),
'reddit_comments': np.random.poisson(200, n_days),
'stocktwits_messages': np.random.poisson(100, n_days),
'stocktwits_bullish_pct': np.random.beta(6, 4, n_days) # Slightly bullish bias
}).set_index('date')
# 2. Web Traffic Data
base_traffic = 1000000 + np.cumsum(np.random.normal(0, 50000, n_days))
web_data = pd.DataFrame({
'date': dates,
'website_visits': np.maximum(base_traffic, 500000).astype(int),
'app_downloads': np.random.poisson(10000, n_days),
'google_trend_score': np.clip(np.random.normal(60, 15, n_days), 0, 100),
'product_reviews': np.random.poisson(500, n_days),
'avg_review_score': np.random.normal(4.2, 0.3, n_days).clip(1, 5)
}).set_index('date')
# 3. Job Posting Data
base_jobs = 200 + np.cumsum(np.random.normal(0, 5, n_days))
job_data = pd.DataFrame({
'date': dates,
'job_postings': np.maximum(base_jobs, 100).astype(int),
'engineering_jobs': np.random.poisson(50, n_days),
'sales_jobs': np.random.poisson(30, n_days),
'avg_salary_listed': np.random.normal(120000, 20000, n_days)
}).set_index('date')
# 4. Satellite/Foot Traffic Data
geo_data = pd.DataFrame({
'date': dates,
'store_foot_traffic': np.random.poisson(5000, n_days),
'parking_lot_fill': np.random.beta(5, 3, n_days),
'shipping_containers': np.random.poisson(1000, n_days)
}).set_index('date')
return {
'social': social_data,
'web': web_data,
'jobs': job_data,
'geo': geo_data
}
# Generate data
alt_data = generate_simulated_alt_data()
print("Generated Alternative Data:")
for source, df in alt_data.items():
print(f"\n{source.upper()}:")
print(f" Columns: {df.columns.tolist()}")
print(f" Shape: {df.shape}")
# Visualize alternative data
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Social media
ax1 = axes[0, 0]
ax1.plot(alt_data['social'].index, alt_data['social']['twitter_mentions'],
label='Twitter Mentions', alpha=0.7)
ax1.set_ylabel('Mentions')
ax1.set_title('Social Media Activity')
ax1_twin = ax1.twinx()
ax1_twin.plot(alt_data['social'].index, alt_data['social']['twitter_sentiment'],
'r-', label='Sentiment', alpha=0.7)
ax1_twin.set_ylabel('Sentiment', color='r')
ax1.legend(loc='upper left')
ax1.grid(True, alpha=0.3)
# Web traffic
axes[0, 1].plot(alt_data['web'].index, alt_data['web']['website_visits'],
label='Website Visits')
axes[0, 1].plot(alt_data['web'].index, alt_data['web']['google_trend_score'] * 20000,
label='Google Trends (scaled)', alpha=0.7)
axes[0, 1].set_ylabel('Visits')
axes[0, 1].set_title('Web Traffic Metrics')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)
# Job postings
axes[1, 0].fill_between(alt_data['jobs'].index, alt_data['jobs']['job_postings'],
alpha=0.5, label='Total Jobs')
axes[1, 0].plot(alt_data['jobs'].index, alt_data['jobs']['engineering_jobs'] * 4,
label='Engineering (x4)', alpha=0.8)
axes[1, 0].set_ylabel('Job Postings')
axes[1, 0].set_title('Job Market Indicators')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)
# Geospatial
axes[1, 1].plot(alt_data['geo'].index, alt_data['geo']['store_foot_traffic'],
label='Foot Traffic')
axes[1, 1].set_ylabel('Daily Visitors')
axes[1, 1].set_title('Physical Activity Indicators')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Exercise 10.1: Alt Data Feature Calculator (Guided)
def calculate_alt_data_features(alt_data: Dict[str, pd.DataFrame],
lookback_days: List[int] = [5, 20]) -> pd.DataFrame:
"""
Calculate features from alternative data sources.
Returns:
DataFrame with all calculated features
"""
features = pd.DataFrame()
# Social features
social = alt_data['social']
features['social_sentiment'] = social['twitter_sentiment']
features['social_volume'] = social['twitter_mentions'] + social['reddit_posts'] * 10
features['bullish_ratio'] = social['stocktwits_bullish_pct']
# TODO: Calculate rolling features for social
for days in lookback_days:
features[f'sentiment_ma{days}'] = social['twitter_sentiment'].______(days).______()
features[f'volume_ma{days}'] = features['social_volume'].______(days).______()
# Web features
web = alt_data['web']
features['web_traffic'] = web['website_visits']
features['google_trends'] = web['google_trend_score']
features['review_score'] = web['avg_review_score']
# TODO: Calculate traffic changes
features['traffic_change_5d'] = web['website_visits'].______(5)
features['traffic_change_20d'] = web['website_visits'].______(20)
# Job features
jobs = alt_data['jobs']
features['job_postings'] = jobs['job_postings']
features['job_growth'] = jobs['job_postings'].pct_change(20)
features['eng_to_sales'] = jobs['engineering_jobs'] / (jobs['sales_jobs'] + 1)
# Geo features
geo = alt_data['geo']
features['foot_traffic'] = geo['store_foot_traffic']
features['parking_fill'] = geo['parking_lot_fill']
return features.dropna()
# Test
# alt_features = calculate_alt_data_features(alt_data)
Solution 10.1
def calculate_alt_data_features(alt_data: Dict[str, pd.DataFrame],
lookback_days: List[int] = [5, 20]) -> pd.DataFrame:
features = pd.DataFrame()
social = alt_data['social']
features['social_sentiment'] = social['twitter_sentiment']
features['social_volume'] = social['twitter_mentions'] + social['reddit_posts'] * 10
features['bullish_ratio'] = social['stocktwits_bullish_pct']
for days in lookback_days:
features[f'sentiment_ma{days}'] = social['twitter_sentiment'].rolling(days).mean()
features[f'volume_ma{days}'] = features['social_volume'].rolling(days).mean()
web = alt_data['web']
features['web_traffic'] = web['website_visits']
features['google_trends'] = web['google_trend_score']
features['review_score'] = web['avg_review_score']
features['traffic_change_5d'] = web['website_visits'].pct_change(5)
features['traffic_change_20d'] = web['website_visits'].pct_change(20)
jobs = alt_data['jobs']
features['job_postings'] = jobs['job_postings']
features['job_growth'] = jobs['job_postings'].pct_change(20)
features['eng_to_sales'] = jobs['engineering_jobs'] / (jobs['sales_jobs'] + 1)
geo = alt_data['geo']
features['foot_traffic'] = geo['store_foot_traffic']
features['parking_fill'] = geo['parking_lot_fill']
return features.dropna()
Section 2: Web Data Collection
Techniques for collecting data from web sources.
# Web Scraping Best Practices
web_scraping_guide = """
WEB DATA COLLECTION
===================
Data Collection Methods:
------------------------
1. APIs (Preferred)
- Official, structured access
- Rate limits and authentication
- Examples: Twitter API, Reddit API, Alpha Vantage
2. Web Scraping
- Parse HTML content
- Tools: BeautifulSoup, Scrapy, Selenium
- Requires understanding of HTML structure
3. Data Vendors
- Pre-processed alternative data
- Examples: Quandl, Bloomberg, Refinitiv
- Higher cost, higher quality
Ethical Considerations:
-----------------------
- Respect robots.txt
- Rate limit your requests
- Don't overload servers
- Respect terms of service
- Handle personal data carefully
Technical Best Practices:
-------------------------
- Use user-agent headers
- Implement exponential backoff
- Cache responses
- Handle errors gracefully
- Log all requests
Common Data Sources:
--------------------
- SEC EDGAR (company filings)
- Google Trends
- GitHub activity
- Wikipedia pageviews
- Job boards (Indeed, LinkedIn)
"""
print(web_scraping_guide)
# Web Data Collector Class
class WebDataCollector:
"""Collect data from web sources with rate limiting and caching."""
def __init__(self, rate_limit: float = 1.0):
"""
Args:
rate_limit: Minimum seconds between requests
"""
self.rate_limit = rate_limit
self.last_request = datetime.min
self.cache = {}
self.headers = {
'User-Agent': 'Mozilla/5.0 (Educational/Research Purpose)'
}
def _wait_for_rate_limit(self):
"""Ensure we don't exceed rate limit."""
elapsed = (datetime.now() - self.last_request).total_seconds()
if elapsed < self.rate_limit:
import time
time.sleep(self.rate_limit - elapsed)
self.last_request = datetime.now()
def fetch_url(self, url: str, use_cache: bool = True) -> Optional[str]:
"""Fetch content from URL."""
if not HAS_REQUESTS:
print("requests library not available")
return None
# Check cache
if use_cache and url in self.cache:
return self.cache[url]
self._wait_for_rate_limit()
try:
response = requests.get(url, headers=self.headers, timeout=10)
response.raise_for_status()
content = response.text
# Cache result
self.cache[url] = content
return content
except requests.exceptions.RequestException as e:
print(f"Error fetching {url}: {e}")
return None
def parse_html(self, html: str, selector: str) -> List[str]:
"""Parse HTML and extract text from elements."""
if not HAS_BS4 or not html:
return []
soup = BeautifulSoup(html, 'html.parser')
elements = soup.select(selector)
return [elem.get_text(strip=True) for elem in elements]
def fetch_json(self, url: str) -> Optional[Dict]:
"""Fetch and parse JSON from URL."""
content = self.fetch_url(url)
if content:
try:
return json.loads(content)
except json.JSONDecodeError:
print(f"Invalid JSON from {url}")
return None
# Example usage
collector = WebDataCollector(rate_limit=1.0)
print("WebDataCollector initialized")
print(f" Rate limit: {collector.rate_limit}s")
print(f" User-Agent: {collector.headers['User-Agent']}")
# Simulated API data (for demonstration without actual API calls)
def simulate_google_trends_data(keyword: str, n_days: int = 90) -> pd.DataFrame:
"""Simulate Google Trends data."""
np.random.seed(hash(keyword) % 2**32)
dates = pd.date_range(end=pd.Timestamp.today(), periods=n_days, freq='D')
# Generate trend with weekly seasonality and random noise
trend = np.random.normal(50, 10, n_days)
seasonality = 10 * np.sin(2 * np.pi * np.arange(n_days) / 7)
values = np.clip(trend + seasonality, 0, 100)
return pd.DataFrame({
'date': dates,
'keyword': keyword,
'interest': values.astype(int)
}).set_index('date')
def simulate_reddit_data(subreddit: str, n_days: int = 90) -> pd.DataFrame:
"""Simulate Reddit activity data."""
np.random.seed(hash(subreddit) % 2**32)
dates = pd.date_range(end=pd.Timestamp.today(), periods=n_days, freq='D')
return pd.DataFrame({
'date': dates,
'subreddit': subreddit,
'posts': np.random.poisson(50, n_days),
'comments': np.random.poisson(500, n_days),
'avg_score': np.random.exponential(100, n_days),
'sentiment': np.random.normal(0.1, 0.4, n_days)
}).set_index('date')
# Get simulated data
google_data = simulate_google_trends_data("Tesla stock")
reddit_data = simulate_reddit_data("wallstreetbets")
print("Simulated Google Trends:")
print(google_data.head())
print("\nSimulated Reddit Data:")
print(reddit_data.head())
# Exercise 10.2: Multi-Source Data Aggregator (Guided)
class MultiSourceAggregator:
"""
Aggregate data from multiple alternative sources.
"""
def __init__(self):
self.sources = {}
self.combined_data = None
def add_source(self, name: str, data: pd.DataFrame,
date_col: str = None):
"""Add a data source."""
df = data.copy()
# Ensure date index
if date_col and date_col in df.columns:
df = df.set_index(date_col)
# Prefix columns with source name
df.columns = [f'{name}_{col}' for col in df.columns]
self.sources[name] = df
return self
def combine(self, fill_method: str = 'ffill') -> pd.DataFrame:
"""Combine all sources into single DataFrame."""
if not self.sources:
return pd.DataFrame()
# TODO: Start with first source
combined = list(self.sources.______())[0].copy()
# TODO: Join remaining sources
for name, df in list(self.sources.items())[1:]:
combined = combined.______(df, how='outer')
# Fill missing values
if fill_method == 'ffill':
combined = combined.fillna(method='ffill')
elif fill_method == 'zero':
combined = combined.fillna(0)
self.combined_data = combined
return combined
def get_correlation_matrix(self) -> pd.DataFrame:
"""Calculate correlation between all features."""
if self.combined_data is None:
self.combine()
return self.combined_data.corr()
# Test
# aggregator = MultiSourceAggregator()
# aggregator.add_source('google', google_data)
# aggregator.add_source('reddit', reddit_data)
Solution 10.2
class MultiSourceAggregator:
def __init__(self):
self.sources = {}
self.combined_data = None
def add_source(self, name: str, data: pd.DataFrame,
date_col: str = None):
df = data.copy()
if date_col and date_col in df.columns:
df = df.set_index(date_col)
df.columns = [f'{name}_{col}' for col in df.columns]
self.sources[name] = df
return self
def combine(self, fill_method: str = 'ffill') -> pd.DataFrame:
if not self.sources:
return pd.DataFrame()
combined = list(self.sources.values())[0].copy()
for name, df in list(self.sources.items())[1:]:
combined = combined.join(df, how='outer')
if fill_method == 'ffill':
combined = combined.fillna(method='ffill')
elif fill_method == 'zero':
combined = combined.fillna(0)
self.combined_data = combined
return combined
def get_correlation_matrix(self) -> pd.DataFrame:
if self.combined_data is None:
self.combine()
return self.combined_data.corr()
Section 3: Social Media Data
Processing and analyzing social media data for trading signals.
# Social Media Data Processing
class SocialMediaProcessor:
"""Process social media data for trading signals."""
def __init__(self):
self.ticker_patterns = {}
def extract_tickers(self, text: str) -> List[str]:
"""Extract stock tickers from text."""
import re
# Pattern: $AAPL or $aapl
cashtag_pattern = r'\$([A-Za-z]{1,5})'
tickers = re.findall(cashtag_pattern, text)
return [t.upper() for t in tickers]
def calculate_engagement_score(self, likes: int, comments: int,
shares: int, followers: int) -> float:
"""Calculate normalized engagement score."""
if followers == 0:
return 0
engagement = (likes + comments * 2 + shares * 3) / followers
return min(engagement * 100, 100) # Cap at 100
def analyze_post_timing(self, timestamps: pd.Series) -> Dict:
"""Analyze timing patterns in posts."""
timestamps = pd.to_datetime(timestamps)
return {
'posts_by_hour': timestamps.dt.hour.value_counts().to_dict(),
'posts_by_day': timestamps.dt.dayofweek.value_counts().to_dict(),
'peak_hour': timestamps.dt.hour.mode().iloc[0] if len(timestamps) > 0 else None,
'weekend_ratio': (timestamps.dt.dayofweek >= 5).mean()
}
def calculate_velocity(self, counts: pd.Series, window: int = 24) -> pd.Series:
"""Calculate rate of change in social activity."""
return counts.diff(window) / window
def detect_anomalies(self, values: pd.Series, threshold: float = 2.0) -> pd.Series:
"""Detect unusual spikes in activity."""
mean = values.rolling(20).mean()
std = values.rolling(20).std()
z_score = (values - mean) / std
return z_score.abs() > threshold
# Test
processor = SocialMediaProcessor()
sample_text = "$AAPL is looking bullish! Also watching $TSLA and $MSFT for breakouts."
tickers = processor.extract_tickers(sample_text)
print(f"Extracted tickers: {tickers}")
engagement = processor.calculate_engagement_score(likes=500, comments=50, shares=25, followers=10000)
print(f"Engagement score: {engagement:.2f}")
# Simulate social media posts
def simulate_social_posts(ticker: str, n_posts: int = 1000) -> pd.DataFrame:
"""Simulate social media posts about a ticker."""
np.random.seed(hash(ticker) % 2**32)
# Generate timestamps over 30 days
base_date = pd.Timestamp.today() - timedelta(days=30)
timestamps = pd.to_datetime(
base_date + pd.to_timedelta(np.random.uniform(0, 30*24*60, n_posts), unit='m')
)
# Simulate post metrics
posts = pd.DataFrame({
'timestamp': timestamps,
'ticker': ticker,
'likes': np.random.exponential(50, n_posts).astype(int),
'comments': np.random.exponential(10, n_posts).astype(int),
'shares': np.random.exponential(5, n_posts).astype(int),
'followers': np.random.exponential(5000, n_posts).astype(int),
'sentiment': np.random.normal(0.1, 0.5, n_posts)
}).sort_values('timestamp')
return posts
# Generate simulated posts
social_posts = simulate_social_posts("AAPL")
print(f"Simulated {len(social_posts)} social media posts")
print(social_posts.head())
# Aggregate social posts to daily features
def aggregate_social_to_daily(posts: pd.DataFrame) -> pd.DataFrame:
"""Aggregate social media posts to daily features."""
posts = posts.copy()
posts['date'] = posts['timestamp'].dt.date
# Aggregate
daily = posts.groupby('date').agg({
'likes': ['sum', 'mean'],
'comments': ['sum', 'mean'],
'shares': ['sum', 'mean'],
'sentiment': ['mean', 'std'],
'timestamp': 'count'
})
# Flatten column names
daily.columns = ['_'.join(col).strip() for col in daily.columns.values]
daily = daily.rename(columns={'timestamp_count': 'post_count'})
# Calculate engagement
daily['total_engagement'] = daily['likes_sum'] + daily['comments_sum'] * 2 + daily['shares_sum'] * 3
# Rolling features
daily['engagement_ma5'] = daily['total_engagement'].rolling(5).mean()
daily['sentiment_ma5'] = daily['sentiment_mean'].rolling(5).mean()
daily['post_velocity'] = daily['post_count'].diff(1)
daily.index = pd.to_datetime(daily.index)
return daily.dropna()
# Aggregate
daily_social = aggregate_social_to_daily(social_posts)
print("Daily Social Features:")
print(daily_social.head())
# Exercise 10.3: Social Momentum Detector (Guided)
class SocialMomentumDetector:
"""
Detect momentum in social media activity.
"""
def __init__(self, short_window: int = 3, long_window: int = 10):
self.short_window = short_window
self.long_window = long_window
def calculate_momentum(self, daily_data: pd.DataFrame) -> pd.DataFrame:
"""Calculate social momentum indicators."""
df = daily_data.copy()
# Volume momentum (post count)
# TODO: Calculate short and long moving averages
df['volume_ma_short'] = df['post_count'].______(self.short_window).______()
df['volume_ma_long'] = df['post_count'].______(self.long_window).______()
df['volume_momentum'] = df['volume_ma_short'] / df['volume_ma_long'] - 1
# Engagement momentum
df['eng_ma_short'] = df['total_engagement'].rolling(self.short_window).mean()
df['eng_ma_long'] = df['total_engagement'].rolling(self.long_window).mean()
df['engagement_momentum'] = df['eng_ma_short'] / df['eng_ma_long'] - 1
# Sentiment momentum
df['sent_ma_short'] = df['sentiment_mean'].rolling(self.short_window).mean()
df['sent_ma_long'] = df['sentiment_mean'].rolling(self.long_window).mean()
df['sentiment_momentum'] = df['sent_ma_short'] - df['sent_ma_long']
return df
def generate_signals(self, momentum_data: pd.DataFrame,
volume_threshold: float = 0.2,
sentiment_threshold: float = 0.1) -> pd.Series:
"""Generate trading signals from momentum."""
signals = pd.Series(0, index=momentum_data.index)
# Bullish: high volume momentum + positive sentiment momentum
bullish = (
(momentum_data['volume_momentum'] > volume_threshold) &
(momentum_data['sentiment_momentum'] > sentiment_threshold)
)
signals[bullish] = 1
# Bearish: high volume momentum + negative sentiment momentum
bearish = (
(momentum_data['volume_momentum'] > volume_threshold) &
(momentum_data['sentiment_momentum'] < -sentiment_threshold)
)
signals[bearish] = -1
return signals
# Test
# detector = SocialMomentumDetector()
# momentum_data = detector.calculate_momentum(daily_social)
Solution 10.3
class SocialMomentumDetector:
def __init__(self, short_window: int = 3, long_window: int = 10):
self.short_window = short_window
self.long_window = long_window
def calculate_momentum(self, daily_data: pd.DataFrame) -> pd.DataFrame:
df = daily_data.copy()
df['volume_ma_short'] = df['post_count'].rolling(self.short_window).mean()
df['volume_ma_long'] = df['post_count'].rolling(self.long_window).mean()
df['volume_momentum'] = df['volume_ma_short'] / df['volume_ma_long'] - 1
df['eng_ma_short'] = df['total_engagement'].rolling(self.short_window).mean()
df['eng_ma_long'] = df['total_engagement'].rolling(self.long_window).mean()
df['engagement_momentum'] = df['eng_ma_short'] / df['eng_ma_long'] - 1
df['sent_ma_short'] = df['sentiment_mean'].rolling(self.short_window).mean()
df['sent_ma_long'] = df['sentiment_mean'].rolling(self.long_window).mean()
df['sentiment_momentum'] = df['sent_ma_short'] - df['sent_ma_long']
return df
def generate_signals(self, momentum_data: pd.DataFrame,
volume_threshold: float = 0.2,
sentiment_threshold: float = 0.1) -> pd.Series:
signals = pd.Series(0, index=momentum_data.index)
bullish = (
(momentum_data['volume_momentum'] > volume_threshold) &
(momentum_data['sentiment_momentum'] > sentiment_threshold)
)
signals[bullish] = 1
bearish = (
(momentum_data['volume_momentum'] > volume_threshold) &
(momentum_data['sentiment_momentum'] < -sentiment_threshold)
)
signals[bearish] = -1
return signals
Section 4: Multi-Source Prediction Model
Building prediction models that combine multiple data sources.
# Multi-Source Feature Engineering
class MultiSourceFeatureEngine:
"""Create features from multiple alternative data sources."""
def __init__(self):
self.feature_names = []
def create_price_features(self, price_df: pd.DataFrame) -> pd.DataFrame:
"""Create features from price data."""
features = pd.DataFrame(index=price_df.index)
features['returns'] = price_df['Close'].pct_change()
features['volatility'] = features['returns'].rolling(20).std()
for p in [5, 10, 20]:
features[f'momentum_{p}'] = price_df['Close'].pct_change(p)
features['volume_ratio'] = price_df['Volume'] / price_df['Volume'].rolling(20).mean()
return features
def create_social_features(self, social_df: pd.DataFrame) -> pd.DataFrame:
"""Create features from social media data."""
features = pd.DataFrame(index=social_df.index)
features['social_volume'] = social_df.get('post_count', 0)
features['social_sentiment'] = social_df.get('sentiment_mean', 0)
features['social_engagement'] = social_df.get('total_engagement', 0)
# Normalize
for col in features.columns:
mean = features[col].rolling(20).mean()
std = features[col].rolling(20).std()
features[f'{col}_zscore'] = (features[col] - mean) / std
return features
def create_web_features(self, web_df: pd.DataFrame) -> pd.DataFrame:
"""Create features from web traffic data."""
features = pd.DataFrame(index=web_df.index)
for col in ['website_visits', 'google_trend_score', 'app_downloads']:
if col in web_df.columns:
features[col] = web_df[col]
features[f'{col}_change'] = web_df[col].pct_change(5)
return features
def combine_all(self, price_df: pd.DataFrame,
social_df: pd.DataFrame = None,
web_df: pd.DataFrame = None,
alt_data: Dict = None) -> pd.DataFrame:
"""Combine all feature sources."""
# Start with price features
combined = self.create_price_features(price_df)
# Add social features
if social_df is not None:
social_features = self.create_social_features(social_df)
combined = combined.join(social_features, how='left')
# Add web features
if web_df is not None:
web_features = self.create_web_features(web_df)
combined = combined.join(web_features, how='left')
# Add other alt data
if alt_data is not None:
for source_name, df in alt_data.items():
df.columns = [f'{source_name}_{col}' for col in df.columns]
combined = combined.join(df, how='left')
# Fill missing
combined = combined.fillna(method='ffill').fillna(0)
self.feature_names = combined.columns.tolist()
return combined.dropna()
# Test
feature_engine = MultiSourceFeatureEngine()
print("MultiSourceFeatureEngine initialized")
# Build multi-source model
def build_multi_source_model(symbol: str = "AAPL") -> Dict:
"""Build and evaluate model with multiple data sources."""
# Get price data
ticker = yf.Ticker(symbol)
price_df = ticker.history(period="1y")
price_df.index = price_df.index.tz_localize(None)
# Create target
price_df['target'] = (price_df['Close'].pct_change().shift(-1) > 0).astype(int)
# Generate simulated alt data
alt_data = generate_simulated_alt_data(len(price_df), symbol)
# Align alt data with price data
for source in alt_data:
alt_data[source].index = price_df.index[:len(alt_data[source])]
# Create features
feature_engine = MultiSourceFeatureEngine()
# Model 1: Price only
price_features = feature_engine.create_price_features(price_df)
# Model 2: Price + Social
combined_social = feature_engine.combine_all(price_df, social_df=alt_data['social'])
# Model 3: All sources
combined_all = feature_engine.combine_all(
price_df,
social_df=alt_data['social'],
web_df=alt_data['web'],
alt_data={'jobs': alt_data['jobs'], 'geo': alt_data['geo']}
)
results = {}
scaler = StandardScaler()
for name, features in [('price_only', price_features),
('price_social', combined_social),
('all_sources', combined_all)]:
# Align with target
target = price_df['target'].loc[features.index]
features = features.loc[target.index]
features = features.dropna()
target = target.loc[features.index]
# Split
split_idx = int(len(features) * 0.8)
X_train, X_test = features[:split_idx], features[split_idx:]
y_train, y_test = target[:split_idx], target[split_idx:]
# Scale and train
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
model.fit(X_train_scaled, y_train)
results[name] = {
'accuracy': accuracy_score(y_test, model.predict(X_test_scaled)),
'n_features': len(features.columns),
'feature_importance': dict(zip(features.columns, model.feature_importances_))
}
return results
# Build models
model_results = build_multi_source_model()
print("Multi-Source Model Results:")
print("=" * 50)
for name, result in model_results.items():
print(f"\n{name}:")
print(f" Accuracy: {result['accuracy']:.2%}")
print(f" Features: {result['n_features']}")
print(f" Top 3 Features:")
for feat, imp in sorted(result['feature_importance'].items(), key=lambda x: -x[1])[:3]:
print(f" {feat}: {imp:.4f}")
# Exercise 10.4: Complete Alt Data System (Open-ended)
#
# Build an AlternativeDataSystem class that:
# - Collects data from multiple simulated sources
# - Creates features from each source
# - Combines all sources with price data
# - Trains and evaluates prediction models
# - Compares value-add of each data source
# - Generates a report on feature importance
#
# Your implementation:
Solution 10.4
class AlternativeDataSystem:
"""Complete alternative data trading system."""
def __init__(self, symbol: str = "AAPL"):
self.symbol = symbol
self.price_data = None
self.alt_data = {}
self.features = None
self.models = {}
self.results = {}
def collect_data(self, period: str = "1y"):
"""Collect price and alternative data."""
# Price data
ticker = yf.Ticker(self.symbol)
self.price_data = ticker.history(period=period)
self.price_data.index = self.price_data.index.tz_localize(None)
# Simulated alt data
n_days = len(self.price_data)
self.alt_data = generate_simulated_alt_data(n_days, self.symbol)
# Align indices
for source in self.alt_data:
self.alt_data[source].index = self.price_data.index[:len(self.alt_data[source])]
return self
def create_features(self):
"""Create features from all sources."""
engine = MultiSourceFeatureEngine()
self.features = engine.combine_all(
self.price_data,
social_df=self.alt_data['social'],
web_df=self.alt_data['web'],
alt_data={'jobs': self.alt_data['jobs'], 'geo': self.alt_data['geo']}
)
return self
def train_models(self, test_frac: float = 0.2):
"""Train models with different feature sets."""
target = (self.price_data['Close'].pct_change().shift(-1) > 0).astype(int)
target = target.loc[self.features.index]
# Define feature sets
price_cols = [c for c in self.features.columns if not any(
s in c for s in ['social', 'web', 'jobs', 'geo'])]
social_cols = [c for c in self.features.columns if 'social' in c]
all_cols = self.features.columns.tolist()
feature_sets = {
'price_only': price_cols,
'price_social': price_cols + social_cols,
'all_sources': all_cols
}
split_idx = int(len(self.features) * (1 - test_frac))
scaler = StandardScaler()
for name, cols in feature_sets.items():
X = self.features[cols]
y = target
X_train, X_test = X[:split_idx], X[split_idx:]
y_train, y_test = y[:split_idx], y[split_idx:]
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
model.fit(X_train_scaled, y_train)
self.models[name] = model
self.results[name] = {
'accuracy': accuracy_score(y_test, model.predict(X_test_scaled)),
'features': cols,
'importance': dict(zip(cols, model.feature_importances_))
}
return self
def compare_sources(self) -> pd.DataFrame:
"""Compare value of each data source."""
rows = []
base_acc = self.results['price_only']['accuracy']
for name, result in self.results.items():
rows.append({
'Model': name,
'Accuracy': result['accuracy'],
'Improvement': result['accuracy'] - base_acc,
'N_Features': len(result['features'])
})
return pd.DataFrame(rows)
def get_top_features(self, n: int = 10) -> pd.DataFrame:
"""Get top features across all models."""
all_importance = self.results['all_sources']['importance']
return pd.DataFrame([
{'feature': k, 'importance': v}
for k, v in sorted(all_importance.items(), key=lambda x: -x[1])[:n]
])
def generate_report(self) -> str:
"""Generate text report."""
report = f"""Alternative Data System Report
================================
Symbol: {self.symbol}
Data Points: {len(self.features)}
Total Features: {len(self.features.columns)}
Model Comparison:
"""
for name, result in self.results.items():
report += f" {name}: {result['accuracy']:.2%} ({len(result['features'])} features)\n"
report += "\nTop Features:\n"
for _, row in self.get_top_features(5).iterrows():
report += f" {row['feature']}: {row['importance']:.4f}\n"
return report
# Exercise 10.5: Data Source Evaluator (Open-ended)
#
# Build a DataSourceEvaluator class that:
# - Tests predictive value of individual data sources
# - Uses ablation studies (removing one source at a time)
# - Calculates information ratio for each source
# - Ranks sources by value-add
# - Visualizes source contributions
#
# Your implementation:
Solution 10.5
class DataSourceEvaluator:
"""Evaluate predictive value of individual data sources."""
def __init__(self, features: pd.DataFrame, target: pd.Series):
self.features = features
self.target = target
self.source_results = {}
def identify_sources(self) -> Dict[str, List[str]]:
"""Identify which features belong to which source."""
sources = {'price': [], 'social': [], 'web': [], 'jobs': [], 'geo': []}
for col in self.features.columns:
assigned = False
for source in ['social', 'web', 'jobs', 'geo']:
if source in col.lower():
sources[source].append(col)
assigned = True
break
if not assigned:
sources['price'].append(col)
return sources
def evaluate_single_source(self, source_name: str,
source_cols: List[str]) -> Dict:
"""Evaluate a single data source."""
if not source_cols:
return {'accuracy': 0, 'features': 0}
X = self.features[source_cols]
y = self.target
split_idx = int(len(X) * 0.8)
X_train, X_test = X[:split_idx], X[split_idx:]
y_train, y_test = y[:split_idx], y[split_idx:]
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
model = RandomForestClassifier(n_estimators=50, max_depth=3, random_state=42)
model.fit(X_train_scaled, y_train)
return {
'accuracy': accuracy_score(y_test, model.predict(X_test_scaled)),
'features': len(source_cols)
}
def ablation_study(self) -> pd.DataFrame:
"""Remove each source and measure impact."""
sources = self.identify_sources()
all_cols = self.features.columns.tolist()
# Full model baseline
full_result = self.evaluate_single_source('all', all_cols)
results = [{'source': 'all', 'accuracy': full_result['accuracy'],
'drop': 0, 'features': len(all_cols)}]
for source, cols in sources.items():
if cols:
remaining_cols = [c for c in all_cols if c not in cols]
if remaining_cols:
result = self.evaluate_single_source(f'without_{source}', remaining_cols)
results.append({
'source': f'without_{source}',
'accuracy': result['accuracy'],
'drop': full_result['accuracy'] - result['accuracy'],
'features': len(remaining_cols)
})
return pd.DataFrame(results).sort_values('drop', ascending=False)
def rank_sources(self) -> pd.DataFrame:
"""Rank data sources by value."""
sources = self.identify_sources()
results = []
for source, cols in sources.items():
result = self.evaluate_single_source(source, cols)
results.append({
'source': source,
'accuracy': result['accuracy'],
'features': result['features'],
'acc_per_feature': result['accuracy'] / max(result['features'], 1)
})
return pd.DataFrame(results).sort_values('accuracy', ascending=False)
def plot_contributions(self):
"""Visualize source contributions."""
ranking = self.rank_sources()
ablation = self.ablation_study()
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
axes[0].bar(ranking['source'], ranking['accuracy'])
axes[0].set_title('Accuracy by Single Source')
axes[0].set_ylabel('Accuracy')
axes[0].axhline(y=0.5, color='red', linestyle='--')
axes[0].tick_params(axis='x', rotation=45)
ablation_plot = ablation[ablation['source'] != 'all']
colors = ['red' if d > 0 else 'green' for d in ablation_plot['drop']]
axes[1].bar(ablation_plot['source'], ablation_plot['drop'], color=colors)
axes[1].set_title('Accuracy Drop When Removing Source')
axes[1].set_ylabel('Accuracy Drop')
axes[1].tick_params(axis='x', rotation=45)
plt.tight_layout()
plt.show()
# Exercise 10.6: Production Alt Data Pipeline (Open-ended)
#
# Build a ProductionAltDataPipeline class that:
# - Simulates real-time data collection
# - Handles missing data and outliers
# - Updates features incrementally
# - Generates trading signals with confidence
# - Tracks data quality metrics
# - Provides alerting for data anomalies
#
# Your implementation:
Solution 10.6
class ProductionAltDataPipeline:
"""Production-ready alternative data pipeline."""
def __init__(self):
self.data_buffer = {}
self.features = pd.DataFrame()
self.model = None
self.scaler = StandardScaler()
self.quality_metrics = {}
self.alerts = []
def ingest_data(self, source: str, data: Dict, timestamp: datetime):
"""Ingest new data point."""
if source not in self.data_buffer:
self.data_buffer[source] = []
data['timestamp'] = timestamp
self.data_buffer[source].append(data)
# Check for anomalies
self._check_anomalies(source, data)
def _check_anomalies(self, source: str, data: Dict):
"""Check for data anomalies."""
if len(self.data_buffer[source]) < 10:
return
# Get recent values
recent = pd.DataFrame(self.data_buffer[source][-20:])
for col in recent.select_dtypes(include=[np.number]).columns:
mean = recent[col].mean()
std = recent[col].std()
if std > 0:
z_score = abs(data.get(col, mean) - mean) / std
if z_score > 3:
self.alerts.append({
'timestamp': data['timestamp'],
'source': source,
'field': col,
'z_score': z_score,
'message': f'Anomaly detected in {source}.{col}'
})
def update_features(self):
"""Update features from buffer."""
if not self.data_buffer:
return
# Convert buffers to DataFrames
dfs = {}
for source, buffer in self.data_buffer.items():
df = pd.DataFrame(buffer)
df = df.set_index('timestamp')
df.columns = [f'{source}_{c}' for c in df.columns]
dfs[source] = df
# Combine
if dfs:
combined = list(dfs.values())[0]
for df in list(dfs.values())[1:]:
combined = combined.join(df, how='outer')
self.features = combined.fillna(method='ffill')
def handle_missing_data(self, method: str = 'ffill'):
"""Handle missing data."""
before = self.features.isna().sum().sum()
if method == 'ffill':
self.features = self.features.fillna(method='ffill')
elif method == 'interpolate':
self.features = self.features.interpolate()
elif method == 'zero':
self.features = self.features.fillna(0)
after = self.features.isna().sum().sum()
self.quality_metrics['missing_filled'] = before - after
def generate_signal(self) -> Dict:
"""Generate trading signal from latest features."""
if self.model is None or self.features.empty:
return {'signal': 0, 'confidence': 0}
latest = self.features.iloc[-1:]
latest_scaled = self.scaler.transform(latest)
prediction = self.model.predict(latest_scaled)[0]
proba = self.model.predict_proba(latest_scaled)[0]
confidence = max(proba)
return {
'signal': prediction,
'signal_name': 'BUY' if prediction == 1 else 'SELL',
'confidence': confidence,
'timestamp': self.features.index[-1]
}
def get_quality_report(self) -> Dict:
"""Generate data quality report."""
return {
'sources': list(self.data_buffer.keys()),
'total_records': sum(len(b) for b in self.data_buffer.values()),
'feature_count': len(self.features.columns),
'missing_values': self.features.isna().sum().to_dict(),
'alerts': len(self.alerts),
'recent_alerts': self.alerts[-5:] if self.alerts else []
}
def train_model(self, target: pd.Series):
"""Train prediction model."""
aligned = self.features.loc[target.index].dropna()
target = target.loc[aligned.index]
X_scaled = self.scaler.fit_transform(aligned)
self.model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
self.model.fit(X_scaled, target)
return self
Module Project: Complete Alternative Data Trading System
Build a comprehensive system combining multiple alternative data sources.
class AltDataTradingSystem:
"""
Complete alternative data trading system.
Combines social, web, job, and geospatial data
with price data for trading signals.
"""
def __init__(self, symbol: str = "AAPL"):
self.symbol = symbol
self.price_data = None
self.alt_data = {}
self.features = None
self.model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
self.scaler = StandardScaler()
def load_data(self, period: str = "1y"):
"""Load price and alternative data."""
# Price data
ticker = yf.Ticker(self.symbol)
self.price_data = ticker.history(period=period)
self.price_data.index = self.price_data.index.tz_localize(None)
# Target
self.price_data['target'] = (self.price_data['Close'].pct_change().shift(-1) > 0).astype(int)
# Simulated alt data
self.alt_data = generate_simulated_alt_data(len(self.price_data), self.symbol)
# Align indices
for source in self.alt_data:
self.alt_data[source].index = self.price_data.index[:len(self.alt_data[source])]
print(f"Loaded {len(self.price_data)} days of price data")
print(f"Alt data sources: {list(self.alt_data.keys())}")
return self
def create_features(self):
"""Create comprehensive feature set."""
features = pd.DataFrame(index=self.price_data.index)
# Price features
features['returns'] = self.price_data['Close'].pct_change()
features['volatility'] = features['returns'].rolling(20).std()
for p in [5, 10, 20]:
features[f'momentum_{p}'] = self.price_data['Close'].pct_change(p)
features['volume_ratio'] = self.price_data['Volume'] / self.price_data['Volume'].rolling(20).mean()
# Social features
social = self.alt_data['social']
features['social_sentiment'] = social['twitter_sentiment']
features['social_volume'] = social['twitter_mentions']
features['bullish_ratio'] = social['stocktwits_bullish_pct']
features['social_sentiment_ma5'] = features['social_sentiment'].rolling(5).mean()
features['social_momentum'] = features['social_sentiment'] - features['social_sentiment_ma5']
# Web features
web = self.alt_data['web']
features['web_traffic'] = web['website_visits']
features['traffic_change'] = web['website_visits'].pct_change(5)
features['google_trends'] = web['google_trend_score']
# Job features
jobs = self.alt_data['jobs']
features['job_growth'] = jobs['job_postings'].pct_change(20)
features['eng_ratio'] = jobs['engineering_jobs'] / (jobs['sales_jobs'] + 1)
# Geo features
geo = self.alt_data['geo']
features['foot_traffic'] = geo['store_foot_traffic']
features['parking_fill'] = geo['parking_lot_fill']
self.features = features.dropna()
self.feature_names = self.features.columns.tolist()
print(f"Created {len(self.feature_names)} features")
return self
def fit(self, test_frac: float = 0.2):
"""Train the model."""
target = self.price_data['target'].loc[self.features.index]
split_idx = int(len(self.features) * (1 - test_frac))
self.X_train = self.features[:split_idx]
self.X_test = self.features[split_idx:]
self.y_train = target[:split_idx]
self.y_test = target[split_idx:]
self.test_returns = self.price_data['Close'].pct_change()[split_idx:]
X_train_scaled = self.scaler.fit_transform(self.X_train)
self.model.fit(X_train_scaled, self.y_train)
return self
def evaluate(self) -> Dict:
"""Evaluate model performance."""
X_test_scaled = self.scaler.transform(self.X_test)
y_pred = self.model.predict(X_test_scaled)
# Classification metrics
accuracy = accuracy_score(self.y_test, y_pred)
# Financial metrics
pred_series = pd.Series(y_pred, index=self.y_test.index)
test_returns = self.test_returns.loc[pred_series.index]
strategy_returns = pred_series.shift(1) * test_returns
strategy_returns = strategy_returns.dropna()
total_return = (1 + strategy_returns).cumprod().iloc[-1] - 1
bh_return = (1 + test_returns.loc[strategy_returns.index]).cumprod().iloc[-1] - 1
sharpe = np.sqrt(252) * strategy_returns.mean() / strategy_returns.std()
return {
'accuracy': accuracy,
'total_return': total_return,
'buy_hold_return': bh_return,
'outperformance': total_return - bh_return,
'sharpe_ratio': sharpe,
'feature_importance': dict(zip(self.feature_names, self.model.feature_importances_))
}
def plot_results(self):
"""Visualize results."""
X_test_scaled = self.scaler.transform(self.X_test)
y_pred = self.model.predict(X_test_scaled)
pred_series = pd.Series(y_pred, index=self.y_test.index)
test_returns = self.test_returns.loc[pred_series.index]
strategy_returns = pred_series.shift(1) * test_returns
cum_strategy = (1 + strategy_returns.fillna(0)).cumprod()
cum_bh = (1 + test_returns.fillna(0)).cumprod()
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Cumulative returns
axes[0, 0].plot(cum_strategy.index, cum_strategy, label='Strategy', linewidth=2)
axes[0, 0].plot(cum_bh.index, cum_bh, label='Buy & Hold', linewidth=2, alpha=0.7)
axes[0, 0].set_ylabel('Cumulative Return')
axes[0, 0].set_title('Alt Data Strategy Performance')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)
# Feature importance
importance = pd.Series(
self.model.feature_importances_,
index=self.feature_names
).sort_values(ascending=True).tail(10)
axes[0, 1].barh(importance.index, importance.values, color='steelblue')
axes[0, 1].set_xlabel('Importance')
axes[0, 1].set_title('Top 10 Feature Importance')
axes[0, 1].grid(True, alpha=0.3)
# Social sentiment vs returns
axes[1, 0].scatter(self.features['social_sentiment'].loc[test_returns.index],
test_returns, alpha=0.5)
axes[1, 0].set_xlabel('Social Sentiment')
axes[1, 0].set_ylabel('Next Day Return')
axes[1, 0].set_title('Sentiment vs Returns')
axes[1, 0].axhline(y=0, color='red', linestyle='--')
axes[1, 0].axvline(x=0, color='red', linestyle='--')
axes[1, 0].grid(True, alpha=0.3)
# Data source contribution
sources = {'price': 0, 'social': 0, 'web': 0, 'job': 0, 'geo': 0}
for feat, imp in zip(self.feature_names, self.model.feature_importances_):
if 'social' in feat or 'bullish' in feat:
sources['social'] += imp
elif 'web' in feat or 'traffic' in feat or 'google' in feat:
sources['web'] += imp
elif 'job' in feat or 'eng' in feat:
sources['job'] += imp
elif 'foot' in feat or 'parking' in feat:
sources['geo'] += imp
else:
sources['price'] += imp
axes[1, 1].pie(sources.values(), labels=sources.keys(), autopct='%1.1f%%')
axes[1, 1].set_title('Feature Importance by Source')
plt.tight_layout()
plt.show()
# Run the complete system
system = AltDataTradingSystem("AAPL")
system.load_data()
system.create_features()
system.fit()
# Evaluate
results = system.evaluate()
print("\n" + "="*50)
print("ALT DATA TRADING SYSTEM RESULTS")
print("="*50)
print(f"\nAccuracy: {results['accuracy']:.2%}")
print(f"\nStrategy Return: {results['total_return']:.2%}")
print(f"Buy & Hold Return: {results['buy_hold_return']:.2%}")
print(f"Outperformance: {results['outperformance']:.2%}")
print(f"Sharpe Ratio: {results['sharpe_ratio']:.2f}")
print(f"\nTop 5 Features:")
for feat, imp in sorted(results['feature_importance'].items(), key=lambda x: -x[1])[:5]:
print(f" {feat}: {imp:.4f}")
# Visualize
system.plot_results()
Key Takeaways
-
Alternative data provides unique signals beyond price and volume
-
Data collection requires attention to rate limits, caching, and error handling
-
Social media data can capture market sentiment in real-time
-
Multi-source models often outperform single-source models
-
Data quality monitoring is essential for production systems
-
Ablation studies help identify which sources provide the most value
-
Alpha decay means alternative data edges diminish as more traders use them
Next: Module 11 - Deep Learning for Finance (Neural networks, LSTM, transformers)
Module 11: Deep Learning for Finance
Overview
Deep learning has revolutionized many fields, and finance is no exception. This module covers neural network architectures particularly suited for financial applications, from basic feedforward networks to advanced sequence models like LSTMs and Transformers.
Learning Objectives
By the end of this module, you will be able to: - Build and train neural networks for financial prediction - Implement LSTM networks for time series forecasting - Apply attention mechanisms and Transformers to market data - Design appropriate architectures for different financial tasks
Prerequisites
- Module 6: Other Classification Models (Neural Network basics)
- Module 8: Regression Models
- Understanding of backpropagation and gradient descent
Estimated Time: 4 hours
Section 1: Neural Network Fundamentals for Finance
Neural networks can capture complex non-linear relationships in financial data that traditional models miss.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')
# Deep learning imports
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, Model, Sequential
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau, ModelCheckpoint
from tensorflow.keras.optimizers import Adam
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, mean_squared_error, mean_absolute_error
np.random.seed(42)
tf.random.set_seed(42)
print(f"TensorFlow version: {tf.__version__}")
print("Deep learning libraries loaded successfully")
# Generate synthetic financial data
def generate_financial_data(n_samples=2000):
"""Generate synthetic stock data with realistic patterns."""
np.random.seed(42)
dates = pd.date_range(start='2018-01-01', periods=n_samples, freq='D')
# Generate price series with trend and volatility clustering
returns = np.random.normal(0.0003, 0.015, n_samples)
# Add volatility clustering (GARCH-like effect)
volatility = np.ones(n_samples) * 0.015
for i in range(1, n_samples):
volatility[i] = 0.9 * volatility[i-1] + 0.1 * abs(returns[i-1]) * 2
returns[i] = np.random.normal(0.0003, volatility[i])
# Generate prices
prices = 100 * np.exp(np.cumsum(returns))
# Create OHLCV data
high = prices * (1 + np.abs(np.random.normal(0, 0.01, n_samples)))
low = prices * (1 - np.abs(np.random.normal(0, 0.01, n_samples)))
volume = np.random.lognormal(15, 0.5, n_samples)
df = pd.DataFrame({
'date': dates,
'open': np.roll(prices, 1),
'high': high,
'low': low,
'close': prices,
'volume': volume
})
df.loc[0, 'open'] = df.loc[0, 'close']
df.set_index('date', inplace=True)
return df
# Generate data
df = generate_financial_data(2000)
print(f"Dataset shape: {df.shape}")
df.tail()
# Feature engineering for neural networks
def create_nn_features(df, lookback_periods=[5, 10, 20, 50]):
"""Create features suitable for neural network input."""
data = df.copy()
# Returns at different horizons
data['return_1d'] = data['close'].pct_change()
data['return_5d'] = data['close'].pct_change(5)
data['return_20d'] = data['close'].pct_change(20)
# Volatility features
for period in lookback_periods:
data[f'volatility_{period}d'] = data['return_1d'].rolling(period).std()
data[f'sma_{period}'] = data['close'].rolling(period).mean()
data[f'price_to_sma_{period}'] = data['close'] / data[f'sma_{period}']
# RSI
delta = data['close'].diff()
gain = (delta.where(delta > 0, 0)).rolling(14).mean()
loss = (-delta.where(delta < 0, 0)).rolling(14).mean()
rs = gain / loss
data['rsi'] = 100 - (100 / (1 + rs))
# MACD
exp12 = data['close'].ewm(span=12).mean()
exp26 = data['close'].ewm(span=26).mean()
data['macd'] = exp12 - exp26
data['macd_signal'] = data['macd'].ewm(span=9).mean()
data['macd_hist'] = data['macd'] - data['macd_signal']
# Volume features
data['volume_sma_20'] = data['volume'].rolling(20).mean()
data['volume_ratio'] = data['volume'] / data['volume_sma_20']
# Price range
data['daily_range'] = (data['high'] - data['low']) / data['close']
data['avg_range_20'] = data['daily_range'].rolling(20).mean()
# Target: Next day return direction
data['target'] = (data['close'].shift(-1) > data['close']).astype(int)
data['target_return'] = data['close'].pct_change().shift(-1)
return data.dropna()
# Create features
df_features = create_nn_features(df)
print(f"Features created: {df_features.shape[1]} columns")
print(f"Samples after processing: {len(df_features)}")
# Building a Feedforward Neural Network for classification
class FinancialNeuralNetwork:
"""Neural network for financial prediction tasks."""
def __init__(self, input_dim, hidden_layers=[64, 32, 16],
dropout_rate=0.3, learning_rate=0.001):
self.input_dim = input_dim
self.hidden_layers = hidden_layers
self.dropout_rate = dropout_rate
self.learning_rate = learning_rate
self.model = None
self.scaler = StandardScaler()
self.history = None
def build_classifier(self):
"""Build classification neural network."""
model = Sequential()
# Input layer
model.add(layers.Dense(self.hidden_layers[0],
input_dim=self.input_dim,
activation='relu',
kernel_regularizer=keras.regularizers.l2(0.01)))
model.add(layers.BatchNormalization())
model.add(layers.Dropout(self.dropout_rate))
# Hidden layers
for units in self.hidden_layers[1:]:
model.add(layers.Dense(units, activation='relu',
kernel_regularizer=keras.regularizers.l2(0.01)))
model.add(layers.BatchNormalization())
model.add(layers.Dropout(self.dropout_rate))
# Output layer
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(
optimizer=Adam(learning_rate=self.learning_rate),
loss='binary_crossentropy',
metrics=['accuracy']
)
self.model = model
return model
def build_regressor(self):
"""Build regression neural network."""
model = Sequential()
# Input layer
model.add(layers.Dense(self.hidden_layers[0],
input_dim=self.input_dim,
activation='relu',
kernel_regularizer=keras.regularizers.l2(0.01)))
model.add(layers.BatchNormalization())
model.add(layers.Dropout(self.dropout_rate))
# Hidden layers
for units in self.hidden_layers[1:]:
model.add(layers.Dense(units, activation='relu',
kernel_regularizer=keras.regularizers.l2(0.01)))
model.add(layers.BatchNormalization())
model.add(layers.Dropout(self.dropout_rate))
# Output layer (linear for regression)
model.add(layers.Dense(1, activation='linear'))
model.compile(
optimizer=Adam(learning_rate=self.learning_rate),
loss='mse',
metrics=['mae']
)
self.model = model
return model
def prepare_data(self, X, fit_scaler=True):
"""Scale features for neural network."""
if fit_scaler:
return self.scaler.fit_transform(X)
return self.scaler.transform(X)
def train(self, X_train, y_train, X_val=None, y_val=None,
epochs=100, batch_size=32, verbose=1):
"""Train the neural network with early stopping."""
callbacks = [
EarlyStopping(monitor='val_loss', patience=10,
restore_best_weights=True),
ReduceLROnPlateau(monitor='val_loss', factor=0.5,
patience=5, min_lr=1e-6)
]
validation_data = None
if X_val is not None and y_val is not None:
validation_data = (X_val, y_val)
self.history = self.model.fit(
X_train, y_train,
validation_data=validation_data,
epochs=epochs,
batch_size=batch_size,
callbacks=callbacks,
verbose=verbose
)
return self.history
def predict(self, X):
"""Make predictions."""
return self.model.predict(X, verbose=0)
def predict_classes(self, X, threshold=0.5):
"""Predict class labels."""
probs = self.predict(X)
return (probs >= threshold).astype(int).flatten()
print("FinancialNeuralNetwork class defined")
# Prepare data for classification
feature_cols = ['return_1d', 'return_5d', 'return_20d',
'volatility_5d', 'volatility_10d', 'volatility_20d',
'price_to_sma_5', 'price_to_sma_10', 'price_to_sma_20', 'price_to_sma_50',
'rsi', 'macd_hist', 'volume_ratio', 'daily_range']
X = df_features[feature_cols].values
y = df_features['target'].values
# Time-based split (no shuffling for time series)
split_idx = int(len(X) * 0.8)
X_train, X_test = X[:split_idx], X[split_idx:]
y_train, y_test = y[:split_idx], y[split_idx:]
# Further split training for validation
val_idx = int(len(X_train) * 0.8)
X_train_nn, X_val = X_train[:val_idx], X_train[val_idx:]
y_train_nn, y_val = y_train[:val_idx], y_train[val_idx:]
# Build and train classifier
nn_classifier = FinancialNeuralNetwork(
input_dim=len(feature_cols),
hidden_layers=[64, 32, 16],
dropout_rate=0.3
)
nn_classifier.build_classifier()
# Scale data
X_train_scaled = nn_classifier.prepare_data(X_train_nn, fit_scaler=True)
X_val_scaled = nn_classifier.prepare_data(X_val, fit_scaler=False)
X_test_scaled = nn_classifier.prepare_data(X_test, fit_scaler=False)
# Train model
print("Training neural network classifier...")
history = nn_classifier.train(
X_train_scaled, y_train_nn,
X_val_scaled, y_val,
epochs=50,
batch_size=32,
verbose=0
)
# Evaluate
train_pred = nn_classifier.predict_classes(X_train_scaled)
test_pred = nn_classifier.predict_classes(X_test_scaled)
print(f"\nTraining accuracy: {accuracy_score(y_train_nn, train_pred):.4f}")
print(f"Test accuracy: {accuracy_score(y_test, test_pred):.4f}")
# Visualize training history
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
# Loss
axes[0].plot(history.history['loss'], label='Training')
axes[0].plot(history.history['val_loss'], label='Validation')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].set_title('Model Loss')
axes[0].legend()
# Accuracy
axes[1].plot(history.history['accuracy'], label='Training')
axes[1].plot(history.history['val_accuracy'], label='Validation')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Accuracy')
axes[1].set_title('Model Accuracy')
axes[1].legend()
plt.tight_layout()
plt.show()
Section 2: LSTM Networks for Time Series
Long Short-Term Memory (LSTM) networks are designed to capture long-term dependencies in sequential data, making them ideal for financial time series.
# LSTM for Financial Time Series
class FinancialLSTM:
"""LSTM network for financial time series prediction."""
def __init__(self, sequence_length=20, n_features=1,
lstm_units=[64, 32], dropout_rate=0.2):
self.sequence_length = sequence_length
self.n_features = n_features
self.lstm_units = lstm_units
self.dropout_rate = dropout_rate
self.model = None
self.scaler = MinMaxScaler()
def create_sequences(self, data, target_col_idx=-1):
"""Create sequences for LSTM input."""
X, y = [], []
for i in range(self.sequence_length, len(data)):
X.append(data[i-self.sequence_length:i])
y.append(data[i, target_col_idx])
return np.array(X), np.array(y)
def build_model(self, output_type='regression'):
"""Build LSTM model."""
model = Sequential()
# First LSTM layer
model.add(layers.LSTM(
self.lstm_units[0],
return_sequences=len(self.lstm_units) > 1,
input_shape=(self.sequence_length, self.n_features)
))
model.add(layers.Dropout(self.dropout_rate))
# Additional LSTM layers
for i, units in enumerate(self.lstm_units[1:]):
return_seq = i < len(self.lstm_units) - 2
model.add(layers.LSTM(units, return_sequences=return_seq))
model.add(layers.Dropout(self.dropout_rate))
# Dense layers
model.add(layers.Dense(16, activation='relu'))
# Output layer
if output_type == 'regression':
model.add(layers.Dense(1, activation='linear'))
model.compile(optimizer=Adam(learning_rate=0.001),
loss='mse', metrics=['mae'])
else:
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer=Adam(learning_rate=0.001),
loss='binary_crossentropy', metrics=['accuracy'])
self.model = model
return model
def prepare_data(self, df, feature_cols, target_col):
"""Prepare data for LSTM training."""
# Get features and target
features = df[feature_cols].values
# Scale features
scaled_features = self.scaler.fit_transform(features)
# Create sequences
X, y = self.create_sequences(scaled_features,
target_col_idx=feature_cols.index(target_col))
return X, y
def train(self, X_train, y_train, X_val=None, y_val=None,
epochs=50, batch_size=32, verbose=1):
"""Train LSTM model."""
callbacks = [
EarlyStopping(monitor='val_loss', patience=10,
restore_best_weights=True),
ReduceLROnPlateau(monitor='val_loss', factor=0.5,
patience=5, min_lr=1e-6)
]
validation_data = None
if X_val is not None:
validation_data = (X_val, y_val)
history = self.model.fit(
X_train, y_train,
validation_data=validation_data,
epochs=epochs,
batch_size=batch_size,
callbacks=callbacks,
verbose=verbose
)
return history
def predict(self, X):
"""Make predictions."""
return self.model.predict(X, verbose=0)
print("FinancialLSTM class defined")
# Prepare data for LSTM
lstm_features = ['return_1d', 'volatility_10d', 'rsi', 'macd_hist', 'volume_ratio']
# Add return as target (shifted for prediction)
df_lstm = df_features[lstm_features].copy()
df_lstm['target_return'] = df_features['return_1d'].shift(-1)
df_lstm = df_lstm.dropna()
# Scale all features
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(df_lstm.values)
# Create sequences
sequence_length = 20
X_seq, y_seq = [], []
for i in range(sequence_length, len(scaled_data)):
X_seq.append(scaled_data[i-sequence_length:i, :-1]) # All features except target
y_seq.append(scaled_data[i, -1]) # Target return
X_seq = np.array(X_seq)
y_seq = np.array(y_seq)
print(f"Sequence shape: {X_seq.shape}")
print(f"Target shape: {y_seq.shape}")
# Train LSTM model
# Time-based split
split_idx = int(len(X_seq) * 0.8)
X_train_lstm = X_seq[:split_idx]
X_test_lstm = X_seq[split_idx:]
y_train_lstm = y_seq[:split_idx]
y_test_lstm = y_seq[split_idx:]
# Validation split
val_idx = int(len(X_train_lstm) * 0.8)
X_train_l, X_val_l = X_train_lstm[:val_idx], X_train_lstm[val_idx:]
y_train_l, y_val_l = y_train_lstm[:val_idx], y_train_lstm[val_idx:]
# Build LSTM
lstm_model = Sequential([
layers.LSTM(64, return_sequences=True,
input_shape=(sequence_length, len(lstm_features))),
layers.Dropout(0.2),
layers.LSTM(32, return_sequences=False),
layers.Dropout(0.2),
layers.Dense(16, activation='relu'),
layers.Dense(1, activation='linear')
])
lstm_model.compile(optimizer=Adam(learning_rate=0.001),
loss='mse', metrics=['mae'])
print("Training LSTM model...")
lstm_history = lstm_model.fit(
X_train_l, y_train_l,
validation_data=(X_val_l, y_val_l),
epochs=50,
batch_size=32,
callbacks=[EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)],
verbose=0
)
# Evaluate
train_pred_lstm = lstm_model.predict(X_train_l, verbose=0)
test_pred_lstm = lstm_model.predict(X_test_lstm, verbose=0)
print(f"\nTraining MSE: {mean_squared_error(y_train_l, train_pred_lstm):.6f}")
print(f"Test MSE: {mean_squared_error(y_test_lstm, test_pred_lstm):.6f}")
print(f"Test MAE: {mean_absolute_error(y_test_lstm, test_pred_lstm):.6f}")
# Visualize LSTM predictions
fig, axes = plt.subplots(2, 1, figsize=(12, 8))
# Training history
axes[0].plot(lstm_history.history['loss'], label='Training Loss')
axes[0].plot(lstm_history.history['val_loss'], label='Validation Loss')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss (MSE)')
axes[0].set_title('LSTM Training History')
axes[0].legend()
# Predictions vs Actual
test_range = range(len(test_pred_lstm))
axes[1].plot(test_range, y_test_lstm, label='Actual', alpha=0.7)
axes[1].plot(test_range, test_pred_lstm, label='Predicted', alpha=0.7)
axes[1].set_xlabel('Time Step')
axes[1].set_ylabel('Scaled Return')
axes[1].set_title('LSTM Predictions vs Actual (Test Set)')
axes[1].legend()
plt.tight_layout()
plt.show()
Section 3: Attention Mechanisms and Transformers
Transformers use attention mechanisms to capture relationships across all time steps simultaneously, often outperforming LSTMs on financial data.
# Custom Attention Layer
class AttentionLayer(layers.Layer):
"""Simple attention mechanism for time series."""
def __init__(self, **kwargs):
super(AttentionLayer, self).__init__(**kwargs)
def build(self, input_shape):
self.W = self.add_weight(
name='attention_weight',
shape=(input_shape[-1], 1),
initializer='glorot_uniform',
trainable=True
)
self.b = self.add_weight(
name='attention_bias',
shape=(input_shape[1], 1),
initializer='zeros',
trainable=True
)
super(AttentionLayer, self).build(input_shape)
def call(self, x):
# Compute attention scores
e = tf.nn.tanh(tf.tensordot(x, self.W, axes=1) + self.b)
a = tf.nn.softmax(e, axis=1)
# Apply attention weights
output = tf.reduce_sum(x * a, axis=1)
return output
def compute_output_shape(self, input_shape):
return (input_shape[0], input_shape[-1])
print("AttentionLayer defined")
# Transformer Block for Financial Data
class TransformerBlock(layers.Layer):
"""Transformer block with multi-head attention."""
def __init__(self, embed_dim, num_heads, ff_dim, dropout_rate=0.1):
super(TransformerBlock, self).__init__()
self.embed_dim = embed_dim
self.num_heads = num_heads
self.ff_dim = ff_dim
self.att = layers.MultiHeadAttention(
num_heads=num_heads,
key_dim=embed_dim
)
self.ffn = Sequential([
layers.Dense(ff_dim, activation='relu'),
layers.Dense(embed_dim)
])
self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
self.dropout1 = layers.Dropout(dropout_rate)
self.dropout2 = layers.Dropout(dropout_rate)
def call(self, inputs, training=False):
# Multi-head self-attention
attn_output = self.att(inputs, inputs)
attn_output = self.dropout1(attn_output, training=training)
out1 = self.layernorm1(inputs + attn_output)
# Feed-forward network
ffn_output = self.ffn(out1)
ffn_output = self.dropout2(ffn_output, training=training)
return self.layernorm2(out1 + ffn_output)
print("TransformerBlock defined")
# Positional Encoding for Transformers
class PositionalEncoding(layers.Layer):
"""Add positional information to embeddings."""
def __init__(self, sequence_length, embed_dim):
super(PositionalEncoding, self).__init__()
self.sequence_length = sequence_length
self.embed_dim = embed_dim
def build(self, input_shape):
# Create positional encoding matrix
position = np.arange(self.sequence_length)[:, np.newaxis]
div_term = np.exp(np.arange(0, self.embed_dim, 2) *
-(np.log(10000.0) / self.embed_dim))
pe = np.zeros((self.sequence_length, self.embed_dim))
pe[:, 0::2] = np.sin(position * div_term)
if self.embed_dim > 1:
pe[:, 1::2] = np.cos(position * div_term[:self.embed_dim//2])
self.pe = tf.constant(pe, dtype=tf.float32)
super(PositionalEncoding, self).build(input_shape)
def call(self, x):
return x + self.pe
print("PositionalEncoding defined")
# Build Financial Transformer Model
def build_financial_transformer(sequence_length, n_features,
embed_dim=32, num_heads=4,
ff_dim=64, num_blocks=2,
output_type='regression'):
"""Build a Transformer model for financial prediction."""
inputs = layers.Input(shape=(sequence_length, n_features))
# Project input to embedding dimension
x = layers.Dense(embed_dim)(inputs)
# Add positional encoding
x = PositionalEncoding(sequence_length, embed_dim)(x)
# Transformer blocks
for _ in range(num_blocks):
x = TransformerBlock(embed_dim, num_heads, ff_dim, dropout_rate=0.1)(x)
# Global average pooling
x = layers.GlobalAveragePooling1D()(x)
# Dense layers
x = layers.Dense(32, activation='relu')(x)
x = layers.Dropout(0.2)(x)
# Output layer
if output_type == 'regression':
outputs = layers.Dense(1, activation='linear')(x)
model = Model(inputs, outputs)
model.compile(optimizer=Adam(learning_rate=0.001),
loss='mse', metrics=['mae'])
else:
outputs = layers.Dense(1, activation='sigmoid')(x)
model = Model(inputs, outputs)
model.compile(optimizer=Adam(learning_rate=0.001),
loss='binary_crossentropy', metrics=['accuracy'])
return model
# Build and train transformer
transformer_model = build_financial_transformer(
sequence_length=20,
n_features=len(lstm_features),
embed_dim=32,
num_heads=4,
ff_dim=64,
num_blocks=2
)
print("Financial Transformer built")
transformer_model.summary()
# Train transformer model
print("Training Transformer model...")
transformer_history = transformer_model.fit(
X_train_l, y_train_l,
validation_data=(X_val_l, y_val_l),
epochs=50,
batch_size=32,
callbacks=[EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)],
verbose=0
)
# Evaluate
transformer_pred = transformer_model.predict(X_test_lstm, verbose=0)
print(f"\nTransformer Test MSE: {mean_squared_error(y_test_lstm, transformer_pred):.6f}")
print(f"Transformer Test MAE: {mean_absolute_error(y_test_lstm, transformer_pred):.6f}")
print(f"\nLSTM Test MSE: {mean_squared_error(y_test_lstm, test_pred_lstm):.6f}")
print(f"LSTM Test MAE: {mean_absolute_error(y_test_lstm, test_pred_lstm):.6f}")
Section 4: Deep Learning Architecture Design
Designing the right architecture is crucial for financial applications. This section covers best practices and common patterns.
# Multi-Input Deep Learning Model
class MultiInputFinanceModel:
"""Deep learning model with multiple input branches."""
def __init__(self, sequence_length=20, n_price_features=5,
n_fundamental_features=10):
self.sequence_length = sequence_length
self.n_price_features = n_price_features
self.n_fundamental_features = n_fundamental_features
self.model = None
def build_model(self):
"""Build multi-input model with LSTM and dense branches."""
# Price time series input (LSTM branch)
price_input = layers.Input(
shape=(self.sequence_length, self.n_price_features),
name='price_input'
)
lstm_out = layers.LSTM(64, return_sequences=True)(price_input)
lstm_out = layers.Dropout(0.2)(lstm_out)
lstm_out = layers.LSTM(32)(lstm_out)
lstm_out = layers.Dropout(0.2)(lstm_out)
lstm_out = layers.Dense(16, activation='relu')(lstm_out)
# Fundamental features input (Dense branch)
fundamental_input = layers.Input(
shape=(self.n_fundamental_features,),
name='fundamental_input'
)
dense_out = layers.Dense(32, activation='relu')(fundamental_input)
dense_out = layers.BatchNormalization()(dense_out)
dense_out = layers.Dropout(0.3)(dense_out)
dense_out = layers.Dense(16, activation='relu')(dense_out)
# Merge branches
merged = layers.Concatenate()([lstm_out, dense_out])
# Final layers
x = layers.Dense(32, activation='relu')(merged)
x = layers.Dropout(0.2)(x)
x = layers.Dense(16, activation='relu')(x)
# Output
output = layers.Dense(1, activation='sigmoid', name='output')(x)
self.model = Model(
inputs=[price_input, fundamental_input],
outputs=output
)
self.model.compile(
optimizer=Adam(learning_rate=0.001),
loss='binary_crossentropy',
metrics=['accuracy']
)
return self.model
# Create multi-input model
multi_model = MultiInputFinanceModel(
sequence_length=20,
n_price_features=5,
n_fundamental_features=10
)
multi_model.build_model()
print("Multi-Input Model Architecture:")
multi_model.model.summary()
# Residual Network for Finance
def residual_block(x, units, dropout_rate=0.2):
"""Create a residual block with skip connection."""
# Main path
shortcut = x
x = layers.Dense(units, activation='relu')(x)
x = layers.BatchNormalization()(x)
x = layers.Dropout(dropout_rate)(x)
x = layers.Dense(units, activation='relu')(x)
x = layers.BatchNormalization()(x)
# Skip connection
if shortcut.shape[-1] != units:
shortcut = layers.Dense(units)(shortcut)
x = layers.Add()([x, shortcut])
x = layers.Activation('relu')(x)
x = layers.Dropout(dropout_rate)(x)
return x
def build_resnet_finance(input_dim, hidden_units=[64, 32, 16]):
"""Build a ResNet-style model for financial data."""
inputs = layers.Input(shape=(input_dim,))
x = layers.Dense(hidden_units[0], activation='relu')(inputs)
x = layers.BatchNormalization()(x)
for units in hidden_units:
x = residual_block(x, units)
x = layers.Dense(16, activation='relu')(x)
outputs = layers.Dense(1, activation='sigmoid')(x)
model = Model(inputs, outputs)
model.compile(
optimizer=Adam(learning_rate=0.001),
loss='binary_crossentropy',
metrics=['accuracy']
)
return model
# Build ResNet model
resnet_model = build_resnet_finance(len(feature_cols), hidden_units=[64, 32, 32, 16])
print("ResNet-style Financial Model:")
resnet_model.summary()
# Compare architectures
print("Training ResNet model for comparison...")
resnet_history = resnet_model.fit(
X_train_scaled, y_train_nn,
validation_data=(X_val_scaled, y_val),
epochs=50,
batch_size=32,
callbacks=[EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)],
verbose=0
)
# Evaluate
resnet_pred = (resnet_model.predict(X_test_scaled, verbose=0) >= 0.5).astype(int)
print("\n" + "="*50)
print("Architecture Comparison (Test Set):")
print("="*50)
print(f"Simple NN Accuracy: {accuracy_score(y_test, test_pred):.4f}")
print(f"ResNet Accuracy: {accuracy_score(y_test, resnet_pred):.4f}")
Section 5: Regularization and Optimization Techniques
Financial data is noisy and prone to overfitting. Proper regularization is essential.
# Custom Financial Loss Functions
def directional_loss(y_true, y_pred):
"""Loss that penalizes wrong direction more than magnitude."""
direction_true = tf.sign(y_true)
direction_pred = tf.sign(y_pred)
# MSE component
mse_loss = tf.reduce_mean(tf.square(y_true - y_pred))
# Direction penalty
direction_penalty = tf.reduce_mean(
tf.cast(direction_true != direction_pred, tf.float32)
)
return mse_loss + 0.5 * direction_penalty
def sharpe_loss(y_true, y_pred):
"""Loss based on Sharpe ratio approximation."""
# Predicted returns (position * actual return)
pred_returns = y_pred * y_true
mean_return = tf.reduce_mean(pred_returns)
std_return = tf.math.reduce_std(pred_returns) + 1e-8
# Negative Sharpe (to minimize)
sharpe = mean_return / std_return
return -sharpe
print("Custom loss functions defined")
# Advanced Regularization Techniques
def build_regularized_model(input_dim, l1_reg=0.001, l2_reg=0.01):
"""Build a heavily regularized model for noisy financial data."""
model = Sequential([
# Input layer with L1/L2 regularization
layers.Dense(
64,
input_dim=input_dim,
activation='relu',
kernel_regularizer=keras.regularizers.l1_l2(l1=l1_reg, l2=l2_reg),
activity_regularizer=keras.regularizers.l2(l2_reg)
),
layers.BatchNormalization(),
layers.Dropout(0.4),
# Hidden layers
layers.Dense(
32,
activation='relu',
kernel_regularizer=keras.regularizers.l2(l2_reg)
),
layers.BatchNormalization(),
layers.Dropout(0.3),
layers.Dense(
16,
activation='relu',
kernel_regularizer=keras.regularizers.l2(l2_reg)
),
layers.Dropout(0.2),
# Output
layers.Dense(1, activation='sigmoid')
])
model.compile(
optimizer=Adam(learning_rate=0.001),
loss='binary_crossentropy',
metrics=['accuracy']
)
return model
# Build and train regularized model
reg_model = build_regularized_model(len(feature_cols))
print("Training regularized model...")
reg_history = reg_model.fit(
X_train_scaled, y_train_nn,
validation_data=(X_val_scaled, y_val),
epochs=50,
batch_size=32,
callbacks=[EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)],
verbose=0
)
# Compare overfitting
print("\nRegularization Effect:")
print(f"Train-Val accuracy gap (Original): {history.history['accuracy'][-1] - history.history['val_accuracy'][-1]:.4f}")
print(f"Train-Val accuracy gap (Regularized): {reg_history.history['accuracy'][-1] - reg_history.history['val_accuracy'][-1]:.4f}")
# Learning Rate Scheduling
def get_lr_schedule(initial_lr=0.001, decay_steps=1000, decay_rate=0.9):
"""Create exponential decay learning rate schedule."""
return keras.optimizers.schedules.ExponentialDecay(
initial_learning_rate=initial_lr,
decay_steps=decay_steps,
decay_rate=decay_rate,
staircase=True
)
def get_warmup_schedule(initial_lr=0.0001, target_lr=0.001, warmup_steps=100):
"""Create learning rate schedule with warmup."""
class WarmupSchedule(keras.optimizers.schedules.LearningRateSchedule):
def __init__(self, initial_lr, target_lr, warmup_steps):
self.initial_lr = initial_lr
self.target_lr = target_lr
self.warmup_steps = warmup_steps
def __call__(self, step):
step = tf.cast(step, tf.float32)
warmup_factor = tf.minimum(step / self.warmup_steps, 1.0)
return self.initial_lr + warmup_factor * (self.target_lr - self.initial_lr)
return WarmupSchedule(initial_lr, target_lr, warmup_steps)
print("Learning rate schedules defined")
Section 6: Module Project - Deep Learning Trading System
Build a complete deep learning trading system that combines multiple architectures.
# Complete Deep Learning Trading System
class DeepLearningTradingSystem:
"""Production-ready deep learning trading system."""
def __init__(self, sequence_length=20):
self.sequence_length = sequence_length
self.feature_scaler = StandardScaler()
self.models = {}
self.histories = {}
def create_features(self, df):
"""Create comprehensive feature set."""
data = df.copy()
# Returns
for period in [1, 5, 10, 20]:
data[f'return_{period}d'] = data['close'].pct_change(period)
# Volatility
for period in [5, 10, 20]:
data[f'volatility_{period}d'] = data['return_1d'].rolling(period).std()
# Technical indicators
for period in [5, 10, 20, 50]:
sma = data['close'].rolling(period).mean()
data[f'price_to_sma_{period}'] = data['close'] / sma
# RSI
delta = data['close'].diff()
gain = (delta.where(delta > 0, 0)).rolling(14).mean()
loss = (-delta.where(delta < 0, 0)).rolling(14).mean()
data['rsi'] = 100 - (100 / (1 + gain / (loss + 1e-10)))
# MACD
exp12 = data['close'].ewm(span=12).mean()
exp26 = data['close'].ewm(span=26).mean()
data['macd'] = exp12 - exp26
data['macd_signal'] = data['macd'].ewm(span=9).mean()
# Volume
data['volume_ratio'] = data['volume'] / data['volume'].rolling(20).mean()
# Target
data['target'] = (data['close'].shift(-1) > data['close']).astype(int)
return data.dropna()
def build_ensemble(self, n_features):
"""Build ensemble of different architectures."""
# 1. Feedforward Network
ff_model = Sequential([
layers.Dense(64, input_dim=n_features, activation='relu',
kernel_regularizer=keras.regularizers.l2(0.01)),
layers.BatchNormalization(),
layers.Dropout(0.3),
layers.Dense(32, activation='relu'),
layers.Dropout(0.2),
layers.Dense(16, activation='relu'),
layers.Dense(1, activation='sigmoid')
])
ff_model.compile(optimizer=Adam(0.001),
loss='binary_crossentropy', metrics=['accuracy'])
self.models['feedforward'] = ff_model
# 2. ResNet-style
self.models['resnet'] = build_resnet_finance(n_features)
print(f"Built ensemble with {len(self.models)} models")
def build_lstm_model(self, n_features):
"""Build LSTM model for sequence data."""
lstm_model = Sequential([
layers.LSTM(64, return_sequences=True,
input_shape=(self.sequence_length, n_features)),
layers.Dropout(0.2),
layers.LSTM(32),
layers.Dropout(0.2),
layers.Dense(16, activation='relu'),
layers.Dense(1, activation='sigmoid')
])
lstm_model.compile(optimizer=Adam(0.001),
loss='binary_crossentropy', metrics=['accuracy'])
self.models['lstm'] = lstm_model
def train_models(self, X_train, y_train, X_val, y_val, epochs=50):
"""Train all models in the ensemble."""
callbacks = [
EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)
]
for name, model in self.models.items():
print(f"\nTraining {name}...")
history = model.fit(
X_train, y_train,
validation_data=(X_val, y_val),
epochs=epochs,
batch_size=32,
callbacks=callbacks,
verbose=0
)
self.histories[name] = history
print(f" Val accuracy: {history.history['val_accuracy'][-1]:.4f}")
def predict_ensemble(self, X, weights=None):
"""Make ensemble predictions."""
if weights is None:
weights = {name: 1/len(self.models) for name in self.models}
predictions = np.zeros((len(X), 1))
for name, model in self.models.items():
pred = model.predict(X, verbose=0)
predictions += weights[name] * pred
return predictions
def generate_signals(self, X, threshold=0.5):
"""Generate trading signals from ensemble."""
probs = self.predict_ensemble(X)
signals = np.where(probs >= threshold, 1, -1)
return signals.flatten(), probs.flatten()
def backtest(self, signals, returns):
"""Simple backtest of signals."""
strategy_returns = signals * returns
cumulative = (1 + strategy_returns).cumprod()
# Metrics
total_return = cumulative.iloc[-1] - 1
sharpe = np.sqrt(252) * strategy_returns.mean() / (strategy_returns.std() + 1e-8)
max_dd = (cumulative / cumulative.cummax() - 1).min()
win_rate = (strategy_returns > 0).mean()
return {
'total_return': total_return,
'sharpe_ratio': sharpe,
'max_drawdown': max_dd,
'win_rate': win_rate,
'cumulative': cumulative
}
print("DeepLearningTradingSystem class defined")
# Run the complete system
system = DeepLearningTradingSystem(sequence_length=20)
# Create features
df_system = system.create_features(df)
# Select features
system_features = ['return_1d', 'return_5d', 'return_10d', 'return_20d',
'volatility_5d', 'volatility_10d', 'volatility_20d',
'price_to_sma_5', 'price_to_sma_10', 'price_to_sma_20',
'rsi', 'macd', 'volume_ratio']
X_sys = df_system[system_features].values
y_sys = df_system['target'].values
returns = df_system['return_1d'].values
# Time-based splits
split_idx = int(len(X_sys) * 0.8)
X_train_sys, X_test_sys = X_sys[:split_idx], X_sys[split_idx:]
y_train_sys, y_test_sys = y_sys[:split_idx], y_sys[split_idx:]
returns_test = returns[split_idx:]
val_idx = int(len(X_train_sys) * 0.8)
X_train_s, X_val_s = X_train_sys[:val_idx], X_train_sys[val_idx:]
y_train_s, y_val_s = y_train_sys[:val_idx], y_train_sys[val_idx:]
# Scale features
X_train_scaled_s = system.feature_scaler.fit_transform(X_train_s)
X_val_scaled_s = system.feature_scaler.transform(X_val_s)
X_test_scaled_s = system.feature_scaler.transform(X_test_sys)
# Build and train ensemble
system.build_ensemble(len(system_features))
system.train_models(X_train_scaled_s, y_train_s, X_val_scaled_s, y_val_s, epochs=30)
# Generate signals and backtest
signals, probs = system.generate_signals(X_test_scaled_s)
# Convert to pandas for backtest
returns_series = pd.Series(returns_test,
index=df_system.index[split_idx:])
signals_series = pd.Series(signals,
index=df_system.index[split_idx:])
# Run backtest
results = system.backtest(signals_series, returns_series)
print("\n" + "="*50)
print("Deep Learning Trading System Results")
print("="*50)
print(f"Total Return: {results['total_return']:.2%}")
print(f"Sharpe Ratio: {results['sharpe_ratio']:.2f}")
print(f"Max Drawdown: {results['max_drawdown']:.2%}")
print(f"Win Rate: {results['win_rate']:.2%}")
# Visualize results
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Cumulative returns
buy_hold = (1 + returns_series).cumprod()
axes[0, 0].plot(results['cumulative'].index, results['cumulative'].values,
label='Strategy', linewidth=2)
axes[0, 0].plot(buy_hold.index, buy_hold.values,
label='Buy & Hold', linewidth=2, alpha=0.7)
axes[0, 0].set_xlabel('Date')
axes[0, 0].set_ylabel('Cumulative Return')
axes[0, 0].set_title('Strategy vs Buy & Hold')
axes[0, 0].legend()
# Prediction probabilities distribution
axes[0, 1].hist(probs, bins=50, edgecolor='black', alpha=0.7)
axes[0, 1].axvline(x=0.5, color='red', linestyle='--', label='Threshold')
axes[0, 1].set_xlabel('Prediction Probability')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].set_title('Ensemble Prediction Distribution')
axes[0, 1].legend()
# Training history comparison
for name, history in system.histories.items():
axes[1, 0].plot(history.history['val_accuracy'], label=name)
axes[1, 0].set_xlabel('Epoch')
axes[1, 0].set_ylabel('Validation Accuracy')
axes[1, 0].set_title('Model Training Comparison')
axes[1, 0].legend()
# Signal distribution over time
signal_ma = pd.Series(signals).rolling(20).mean()
axes[1, 1].plot(signal_ma.values)
axes[1, 1].axhline(y=0, color='red', linestyle='--')
axes[1, 1].set_xlabel('Time')
axes[1, 1].set_ylabel('Signal (20-day MA)')
axes[1, 1].set_title('Trading Signal Trend')
plt.tight_layout()
plt.show()
Exercises
Complete the following exercises to practice deep learning for finance.
Exercise 11.1: Build Custom Neural Network (Guided)
Complete the neural network architecture with proper layers.
Solution 11.1
def build_custom_classifier(input_dim, hidden_units=[128, 64, 32]):
model = Sequential()
# Add first Dense layer with input_dim
model.add(layers.Dense(hidden_units[0], input_dim=input_dim, activation='relu'))
model.add(layers.BatchNormalization())
model.add(layers.Dropout(0.3))
# Add remaining hidden layers
for units in hidden_units[1:]:
model.add(layers.Dense(units, activation='relu'))
model.add(layers.BatchNormalization())
model.add(layers.Dropout(0.3))
# Add output layer
model.add(layers.Dense(1, activation='sigmoid'))
# Compile model
model.compile(optimizer=Adam(0.001), loss='binary_crossentropy', metrics=['accuracy'])
return model
Exercise 11.2: Implement LSTM Sequence Model (Guided)
Build an LSTM model for sequence prediction.
Solution 11.2
def build_lstm_classifier(sequence_length, n_features, lstm_units=[64, 32]):
model = Sequential()
# Add first LSTM layer (return sequences for stacking)
model.add(layers.LSTM(
lstm_units[0],
return_sequences=True,
input_shape=(sequence_length, n_features)
))
model.add(layers.Dropout(0.2))
# Add second LSTM layer (no return sequences)
model.add(layers.LSTM(lstm_units[1], return_sequences=False))
model.add(layers.Dropout(0.2))
# Add Dense layers and output
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer=Adam(0.001), loss='binary_crossentropy', metrics=['accuracy'])
return model
Exercise 11.3: Create Training Pipeline (Guided)
Implement a training pipeline with proper callbacks.
Solution 11.3
def train_with_callbacks(model, X_train, y_train, X_val, y_val,
epochs=100, batch_size=32):
# Create callbacks list
callbacks = [
EarlyStopping(
monitor='val_loss',
patience=10,
restore_best_weights=True
),
ReduceLROnPlateau(
monitor='val_loss',
factor=0.5,
patience=5,
min_lr=1e-6
)
]
# Train model
history = model.fit(
X_train, y_train,
validation_data=(X_val, y_val),
epochs=epochs,
batch_size=batch_size,
callbacks=callbacks,
verbose=1
)
return history
Exercise 11.4: Build a Bidirectional LSTM (Open-ended)
Create a Bidirectional LSTM model that can learn patterns from both past and future context in the sequence.
Solution 11.4
def build_bidirectional_lstm(sequence_length, n_features, lstm_units=[64, 32]):
model = Sequential([
layers.Bidirectional(
layers.LSTM(lstm_units[0], return_sequences=True),
input_shape=(sequence_length, n_features)
),
layers.Dropout(0.2),
layers.Bidirectional(
layers.LSTM(lstm_units[1], return_sequences=False)
),
layers.Dropout(0.2),
layers.Dense(32, activation='relu'),
layers.BatchNormalization(),
layers.Dropout(0.2),
layers.Dense(16, activation='relu'),
layers.Dense(1, activation='sigmoid')
])
model.compile(
optimizer=Adam(learning_rate=0.001),
loss='binary_crossentropy',
metrics=['accuracy']
)
return model
# Build and test
bilstm = build_bidirectional_lstm(20, 5)
bilstm.summary()
Exercise 11.5: Implement Attention Mechanism (Open-ended)
Add a custom attention layer to an LSTM model to focus on important time steps.
Solution 11.5
def build_lstm_attention(sequence_length, n_features, lstm_units=64):
inputs = layers.Input(shape=(sequence_length, n_features))
# LSTM layer with sequence output
lstm_out = layers.LSTM(lstm_units, return_sequences=True)(inputs)
lstm_out = layers.Dropout(0.2)(lstm_out)
# Apply attention
attention_out = AttentionLayer()(lstm_out)
# Dense layers
x = layers.Dense(32, activation='relu')(attention_out)
x = layers.BatchNormalization()(x)
x = layers.Dropout(0.2)(x)
x = layers.Dense(16, activation='relu')(x)
# Output
outputs = layers.Dense(1, activation='sigmoid')(x)
model = Model(inputs, outputs)
model.compile(
optimizer=Adam(0.001),
loss='binary_crossentropy',
metrics=['accuracy']
)
return model
# Build and test
attention_model = build_lstm_attention(20, 5)
attention_model.summary()
Exercise 11.6: Create Model Ensemble with Weighted Voting (Open-ended)
Build an ensemble of different deep learning architectures with learned weights.
Solution 11.6
class WeightedDeepEnsemble:
def __init__(self):
self.models = {}
self.weights = {}
self.scaler = StandardScaler()
def build_models(self, input_dim, sequence_length=None, n_features=None):
# Feedforward
ff = Sequential([
layers.Dense(64, input_dim=input_dim, activation='relu'),
layers.BatchNormalization(),
layers.Dropout(0.3),
layers.Dense(32, activation='relu'),
layers.Dense(1, activation='sigmoid')
])
ff.compile(optimizer=Adam(0.001), loss='binary_crossentropy', metrics=['accuracy'])
self.models['feedforward'] = ff
# CNN-1D (if sequence data)
if sequence_length and n_features:
cnn = Sequential([
layers.Conv1D(32, 3, activation='relu',
input_shape=(sequence_length, n_features)),
layers.MaxPooling1D(2),
layers.Conv1D(64, 3, activation='relu'),
layers.GlobalMaxPooling1D(),
layers.Dense(32, activation='relu'),
layers.Dense(1, activation='sigmoid')
])
cnn.compile(optimizer=Adam(0.001), loss='binary_crossentropy', metrics=['accuracy'])
self.models['cnn'] = cnn
def train_and_weight(self, X_train, y_train, X_val, y_val, epochs=30):
val_accuracies = {}
for name, model in self.models.items():
history = model.fit(
X_train, y_train,
validation_data=(X_val, y_val),
epochs=epochs,
batch_size=32,
callbacks=[EarlyStopping(patience=5, restore_best_weights=True)],
verbose=0
)
val_accuracies[name] = max(history.history['val_accuracy'])
# Calculate weights based on validation accuracy
total = sum(val_accuracies.values())
self.weights = {name: acc/total for name, acc in val_accuracies.items()}
print(f"Learned weights: {self.weights}")
def predict(self, X):
predictions = np.zeros((len(X), 1))
for name, model in self.models.items():
predictions += self.weights[name] * model.predict(X, verbose=0)
return predictions
# Usage
ensemble = WeightedDeepEnsemble()
ensemble.build_models(input_dim=14)
Summary
In this module, you learned:
-
Neural Network Fundamentals: Building feedforward networks with proper regularization for financial data
-
LSTM Networks: Implementing sequence models that capture temporal dependencies in price data
-
Transformers and Attention: Using attention mechanisms to identify important time steps
-
Architecture Design: Creating multi-input models and residual connections
-
Regularization Techniques: Preventing overfitting with dropout, batch normalization, and L1/L2 regularization
-
Production Systems: Building complete trading systems with deep learning ensembles
Key Takeaways
- Financial data requires heavy regularization due to low signal-to-noise ratio
- LSTM and Transformers capture different types of temporal patterns
- Ensemble methods combine strengths of multiple architectures
- Proper data preprocessing (scaling, sequencing) is critical for deep learning
- Always use validation sets and early stopping to prevent overfitting
Next Steps
In Module 12, you'll learn about backtesting ML strategies properly, including walk-forward optimization and avoiding common pitfalls like look-ahead bias.
Module 12: Backtesting ML Strategies
Overview
Backtesting ML trading strategies requires special care to avoid common pitfalls like look-ahead bias and overfitting. This module covers proper backtesting methodology, walk-forward optimization, and realistic performance evaluation.
Learning Objectives
By the end of this module, you will be able to: - Implement proper walk-forward validation for ML strategies - Identify and avoid common backtesting pitfalls - Build realistic backtesting frameworks with transaction costs - Apply robust performance evaluation techniques
Prerequisites
- Module 7: Model Evaluation
- Module 11: Deep Learning for Finance
- Understanding of time series cross-validation
Estimated Time: 3 hours
Section 1: Walk-Forward Optimization
Walk-forward optimization is the gold standard for backtesting ML strategies, simulating real-world conditions where models are periodically retrained.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import warnings
warnings.filterwarnings('ignore')
np.random.seed(42)
print("Libraries loaded for backtesting")
# Generate realistic financial data
def generate_backtest_data(n_samples=3000):
"""Generate synthetic data with regime changes."""
np.random.seed(42)
dates = pd.date_range(start='2015-01-01', periods=n_samples, freq='D')
# Create regime-switching returns
regime = np.zeros(n_samples)
current_regime = 0
for i in range(n_samples):
if np.random.random() < 0.01: # 1% chance to switch regime
current_regime = 1 - current_regime
regime[i] = current_regime
# Generate returns based on regime
returns = np.where(
regime == 0,
np.random.normal(0.0005, 0.012, n_samples), # Bull regime
np.random.normal(-0.0002, 0.018, n_samples) # Bear regime
)
# Generate prices
prices = 100 * np.exp(np.cumsum(returns))
# Create OHLCV
df = pd.DataFrame({
'date': dates,
'open': np.roll(prices, 1),
'high': prices * (1 + np.abs(np.random.normal(0, 0.008, n_samples))),
'low': prices * (1 - np.abs(np.random.normal(0, 0.008, n_samples))),
'close': prices,
'volume': np.random.lognormal(15, 0.5, n_samples),
'regime': regime
})
df.loc[0, 'open'] = df.loc[0, 'close']
df.set_index('date', inplace=True)
return df
# Generate data
df = generate_backtest_data(3000)
print(f"Dataset: {df.index[0].date()} to {df.index[-1].date()}")
print(f"Samples: {len(df)}")
df.tail()
# Feature engineering
def create_features(df):
"""Create features for ML model."""
data = df.copy()
# Returns
for period in [1, 5, 10, 20]:
data[f'return_{period}d'] = data['close'].pct_change(period)
# Volatility
for period in [5, 10, 20]:
data[f'volatility_{period}d'] = data['return_1d'].rolling(period).std()
# Moving averages
for period in [5, 10, 20, 50]:
data[f'sma_{period}'] = data['close'].rolling(period).mean()
data[f'price_to_sma_{period}'] = data['close'] / data[f'sma_{period}']
# RSI
delta = data['close'].diff()
gain = (delta.where(delta > 0, 0)).rolling(14).mean()
loss = (-delta.where(delta < 0, 0)).rolling(14).mean()
data['rsi'] = 100 - (100 / (1 + gain / (loss + 1e-10)))
# MACD
exp12 = data['close'].ewm(span=12).mean()
exp26 = data['close'].ewm(span=26).mean()
data['macd'] = exp12 - exp26
data['macd_signal'] = data['macd'].ewm(span=9).mean()
# Volume
data['volume_sma'] = data['volume'].rolling(20).mean()
data['volume_ratio'] = data['volume'] / data['volume_sma']
# Target: next day direction
data['target'] = (data['close'].shift(-1) > data['close']).astype(int)
data['future_return'] = data['close'].pct_change().shift(-1)
return data.dropna()
# Create features
df_features = create_features(df)
print(f"Features created: {len(df_features)} samples")
# Walk-Forward Optimizer
class WalkForwardOptimizer:
"""Walk-forward optimization framework for ML strategies."""
def __init__(self, model, train_window=252, test_window=21,
step_size=21, min_train_samples=100):
"""
Args:
model: sklearn-compatible model
train_window: Number of days for training (1 year = 252)
test_window: Number of days for testing (1 month = 21)
step_size: How often to retrain (monthly = 21)
min_train_samples: Minimum samples needed for training
"""
self.model = model
self.train_window = train_window
self.test_window = test_window
self.step_size = step_size
self.min_train_samples = min_train_samples
self.scaler = StandardScaler()
self.results = []
def run(self, df, feature_cols, target_col='target'):
"""Run walk-forward optimization."""
X = df[feature_cols].values
y = df[target_col].values
dates = df.index
n_samples = len(X)
predictions = np.full(n_samples, np.nan)
probabilities = np.full(n_samples, np.nan)
# Walk-forward loop
start_idx = self.train_window
fold = 0
while start_idx + self.test_window <= n_samples:
# Define train and test indices
train_start = max(0, start_idx - self.train_window)
train_end = start_idx
test_start = start_idx
test_end = min(start_idx + self.test_window, n_samples)
# Get train and test data
X_train = X[train_start:train_end]
y_train = y[train_start:train_end]
X_test = X[test_start:test_end]
y_test = y[test_start:test_end]
# Skip if insufficient training data
if len(X_train) < self.min_train_samples:
start_idx += self.step_size
continue
# Scale features
X_train_scaled = self.scaler.fit_transform(X_train)
X_test_scaled = self.scaler.transform(X_test)
# Train model
self.model.fit(X_train_scaled, y_train)
# Predict
pred = self.model.predict(X_test_scaled)
prob = self.model.predict_proba(X_test_scaled)[:, 1]
# Store predictions
predictions[test_start:test_end] = pred
probabilities[test_start:test_end] = prob
# Record fold results
fold_accuracy = accuracy_score(y_test, pred)
self.results.append({
'fold': fold,
'train_start': dates[train_start],
'train_end': dates[train_end-1],
'test_start': dates[test_start],
'test_end': dates[test_end-1],
'accuracy': fold_accuracy,
'n_train': len(X_train),
'n_test': len(X_test)
})
fold += 1
start_idx += self.step_size
# Create results dataframe
df_results = df.copy()
df_results['prediction'] = predictions
df_results['probability'] = probabilities
df_results['signal'] = np.where(predictions == 1, 1, -1)
return df_results
def get_fold_summary(self):
"""Get summary of all folds."""
return pd.DataFrame(self.results)
print("WalkForwardOptimizer class defined")
# Run walk-forward optimization
feature_cols = ['return_1d', 'return_5d', 'return_10d', 'return_20d',
'volatility_5d', 'volatility_10d', 'volatility_20d',
'price_to_sma_5', 'price_to_sma_10', 'price_to_sma_20',
'rsi', 'macd', 'volume_ratio']
# Initialize optimizer with Random Forest
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
wfo = WalkForwardOptimizer(
model=model,
train_window=252, # 1 year training
test_window=21, # 1 month testing
step_size=21 # Retrain monthly
)
# Run walk-forward
results = wfo.run(df_features, feature_cols)
# Get fold summary
fold_summary = wfo.get_fold_summary()
print(f"\nWalk-forward completed: {len(fold_summary)} folds")
print(f"Average accuracy: {fold_summary['accuracy'].mean():.4f}")
print(f"Accuracy std: {fold_summary['accuracy'].std():.4f}")
# Visualize walk-forward results
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Accuracy by fold
axes[0, 0].bar(fold_summary['fold'], fold_summary['accuracy'])
axes[0, 0].axhline(y=0.5, color='red', linestyle='--', label='Random')
axes[0, 0].axhline(y=fold_summary['accuracy'].mean(), color='green',
linestyle='--', label='Average')
axes[0, 0].set_xlabel('Fold')
axes[0, 0].set_ylabel('Accuracy')
axes[0, 0].set_title('Accuracy by Fold')
axes[0, 0].legend()
# Rolling accuracy over time
test_mask = ~results['prediction'].isna()
rolling_acc = (results.loc[test_mask, 'prediction'] ==
results.loc[test_mask, 'target']).rolling(63).mean()
axes[0, 1].plot(rolling_acc.index, rolling_acc.values)
axes[0, 1].axhline(y=0.5, color='red', linestyle='--')
axes[0, 1].set_xlabel('Date')
axes[0, 1].set_ylabel('Rolling Accuracy (63-day)')
axes[0, 1].set_title('Accuracy Over Time')
# Prediction probability distribution
valid_probs = results['probability'].dropna()
axes[1, 0].hist(valid_probs, bins=50, edgecolor='black', alpha=0.7)
axes[1, 0].axvline(x=0.5, color='red', linestyle='--')
axes[1, 0].set_xlabel('Prediction Probability')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('Prediction Distribution')
# Training window visualization
axes[1, 1].scatter(fold_summary['fold'], fold_summary['n_train'],
label='Train samples', alpha=0.7)
axes[1, 1].scatter(fold_summary['fold'], fold_summary['n_test'],
label='Test samples', alpha=0.7)
axes[1, 1].set_xlabel('Fold')
axes[1, 1].set_ylabel('Number of Samples')
axes[1, 1].set_title('Sample Sizes by Fold')
axes[1, 1].legend()
plt.tight_layout()
plt.show()
Section 2: Avoiding Backtesting Pitfalls
Common backtesting mistakes can make a losing strategy appear profitable.
# Demonstrating Look-Ahead Bias
def demonstrate_lookahead_bias(df, feature_cols):
"""Show the impact of look-ahead bias."""
# WRONG: Using future information in features
df_wrong = df.copy()
# This uses the next day's close to create today's feature!
df_wrong['future_leak'] = df_wrong['close'].shift(-1) / df_wrong['close'] - 1
# Split data
split_idx = int(len(df_wrong) * 0.7)
# Train with leaked feature
X_train = df_wrong[feature_cols + ['future_leak']].iloc[:split_idx].values
X_test = df_wrong[feature_cols + ['future_leak']].iloc[split_idx:].values
y_train = df_wrong['target'].iloc[:split_idx].values
y_test = df_wrong['target'].iloc[split_idx:].values
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train[:-1]) # Remove last row (NaN)
X_test_scaled = scaler.transform(X_test[:-1])
y_train = y_train[:-1]
y_test = y_test[:-1]
model_biased = RandomForestClassifier(n_estimators=100, random_state=42)
model_biased.fit(X_train_scaled, y_train)
biased_accuracy = accuracy_score(y_test, model_biased.predict(X_test_scaled))
# Train without leaked feature (correct approach)
X_train_correct = df[feature_cols].iloc[:split_idx].values
X_test_correct = df[feature_cols].iloc[split_idx:].values
y_train_correct = df['target'].iloc[:split_idx].values
y_test_correct = df['target'].iloc[split_idx:].values
scaler2 = StandardScaler()
X_train_scaled2 = scaler2.fit_transform(X_train_correct)
X_test_scaled2 = scaler2.transform(X_test_correct)
model_correct = RandomForestClassifier(n_estimators=100, random_state=42)
model_correct.fit(X_train_scaled2, y_train_correct)
correct_accuracy = accuracy_score(y_test_correct, model_correct.predict(X_test_scaled2))
return biased_accuracy, correct_accuracy
biased_acc, correct_acc = demonstrate_lookahead_bias(df_features, feature_cols)
print("Look-Ahead Bias Demonstration:")
print(f"Accuracy WITH look-ahead bias: {biased_acc:.4f} (unrealistically high!)")
print(f"Accuracy WITHOUT look-ahead bias: {correct_acc:.4f} (realistic)")
# Survivorship Bias Simulator
def simulate_survivorship_bias(n_stocks=100, n_periods=252, survival_rate=0.8):
"""Simulate the impact of survivorship bias."""
np.random.seed(42)
# Generate returns for all stocks
all_returns = np.random.normal(0.0003, 0.02, (n_stocks, n_periods))
# Some stocks will "die" (go bankrupt)
# Dying stocks tend to have worse returns before death
n_deaths = int(n_stocks * (1 - survival_rate))
death_indices = np.random.choice(n_stocks, n_deaths, replace=False)
# Make dying stocks have negative returns
for idx in death_indices:
death_period = np.random.randint(n_periods // 2, n_periods)
all_returns[idx, :death_period] = np.random.normal(-0.002, 0.03, death_period)
all_returns[idx, death_period:] = np.nan # Dead after this
# Calculate true average (including dead stocks)
true_avg_return = np.nanmean(all_returns)
# Calculate survivorship-biased average (only surviving stocks)
survivor_mask = ~np.isnan(all_returns[:, -1]) # Stocks that survived to end
biased_avg_return = np.mean(all_returns[survivor_mask])
return true_avg_return, biased_avg_return, death_indices
true_ret, biased_ret, deaths = simulate_survivorship_bias()
print("\nSurvivorship Bias Simulation:")
print(f"True average daily return: {true_ret:.4%}")
print(f"Biased average daily return: {biased_ret:.4%}")
print(f"Annualized difference: {(biased_ret - true_ret) * 252:.2%}")
# Overfitting Detection
def detect_overfitting(df, feature_cols, max_depth_range=range(2, 20)):
"""Detect overfitting by comparing train/test performance."""
X = df[feature_cols].values
y = df['target'].values
# Time-based split
split_idx = int(len(X) * 0.7)
X_train, X_test = X[:split_idx], X[split_idx:]
y_train, y_test = y[:split_idx], y[split_idx:]
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
results = []
for depth in max_depth_range:
model = RandomForestClassifier(n_estimators=100, max_depth=depth, random_state=42)
model.fit(X_train_scaled, y_train)
train_acc = accuracy_score(y_train, model.predict(X_train_scaled))
test_acc = accuracy_score(y_test, model.predict(X_test_scaled))
results.append({
'max_depth': depth,
'train_accuracy': train_acc,
'test_accuracy': test_acc,
'overfit_gap': train_acc - test_acc
})
return pd.DataFrame(results)
overfit_results = detect_overfitting(df_features, feature_cols)
print("\nOverfitting Analysis:")
print(overfit_results.to_string(index=False))
# Visualize overfitting
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Train vs Test accuracy
axes[0].plot(overfit_results['max_depth'], overfit_results['train_accuracy'],
'b-', label='Training', marker='o')
axes[0].plot(overfit_results['max_depth'], overfit_results['test_accuracy'],
'r-', label='Test', marker='o')
axes[0].set_xlabel('Max Depth')
axes[0].set_ylabel('Accuracy')
axes[0].set_title('Model Complexity vs Performance')
axes[0].legend()
# Overfit gap
axes[1].bar(overfit_results['max_depth'], overfit_results['overfit_gap'])
axes[1].axhline(y=0.05, color='red', linestyle='--', label='Warning threshold')
axes[1].set_xlabel('Max Depth')
axes[1].set_ylabel('Train - Test Gap')
axes[1].set_title('Overfitting Gap')
axes[1].legend()
plt.tight_layout()
plt.show()
# Recommend optimal depth
best_depth = overfit_results.loc[overfit_results['test_accuracy'].idxmax(), 'max_depth']
print(f"\nRecommended max_depth: {best_depth}")
Section 3: Realistic Backtesting Framework
A proper backtest must account for transaction costs, slippage, and realistic execution.
# Realistic Backtester
class RealisticBacktester:
"""Backtesting framework with realistic assumptions."""
def __init__(self, initial_capital=100000, commission=0.001,
slippage=0.0005, max_position_size=0.1):
"""
Args:
initial_capital: Starting capital
commission: Commission per trade (0.1% = 0.001)
slippage: Expected slippage (0.05% = 0.0005)
max_position_size: Maximum position as fraction of portfolio
"""
self.initial_capital = initial_capital
self.commission = commission
self.slippage = slippage
self.max_position_size = max_position_size
def run_backtest(self, df, signal_col='signal', return_col='future_return'):
"""Run backtest with realistic assumptions."""
results = df.copy()
# Remove NaN signals
mask = ~results[signal_col].isna() & ~results[return_col].isna()
results = results[mask].copy()
# Initialize tracking
capital = self.initial_capital
position = 0 # 1 = long, -1 = short, 0 = flat
capitals = []
positions = []
trades = []
costs = []
for idx, row in results.iterrows():
signal = row[signal_col]
ret = row[return_col]
# Calculate position change
if signal != position:
# Trade occurred
trade_cost = abs(signal - position) * capital * (self.commission + self.slippage)
capital -= trade_cost
trades.append(1)
costs.append(trade_cost)
else:
trades.append(0)
costs.append(0)
# Update position
position = signal
# Apply position sizing
effective_position = position * self.max_position_size
# Calculate return
capital = capital * (1 + effective_position * ret)
capitals.append(capital)
positions.append(position)
results['capital'] = capitals
results['position'] = positions
results['trade'] = trades
results['cost'] = costs
return results
def calculate_metrics(self, results):
"""Calculate performance metrics."""
capitals = results['capital'].values
# Returns
returns = np.diff(capitals) / capitals[:-1]
# Total return
total_return = (capitals[-1] / capitals[0]) - 1
# Annualized return
n_years = len(capitals) / 252
annual_return = (1 + total_return) ** (1/n_years) - 1
# Sharpe ratio
sharpe = np.sqrt(252) * np.mean(returns) / (np.std(returns) + 1e-8)
# Max drawdown
peak = np.maximum.accumulate(capitals)
drawdown = (capitals - peak) / peak
max_drawdown = np.min(drawdown)
# Win rate
win_rate = np.mean(returns > 0)
# Trade statistics
n_trades = results['trade'].sum()
total_costs = results['cost'].sum()
return {
'total_return': total_return,
'annual_return': annual_return,
'sharpe_ratio': sharpe,
'max_drawdown': max_drawdown,
'win_rate': win_rate,
'n_trades': n_trades,
'total_costs': total_costs,
'cost_drag': total_costs / self.initial_capital
}
print("RealisticBacktester class defined")
# Run realistic backtest
backtester = RealisticBacktester(
initial_capital=100000,
commission=0.001, # 0.1%
slippage=0.0005, # 0.05%
max_position_size=1.0 # Full position
)
# Use walk-forward results
backtest_results = backtester.run_backtest(results, signal_col='signal', return_col='future_return')
# Calculate metrics
metrics = backtester.calculate_metrics(backtest_results)
print("\n" + "="*50)
print("Realistic Backtest Results")
print("="*50)
for key, value in metrics.items():
if 'return' in key or 'drawdown' in key or 'rate' in key or 'drag' in key:
print(f"{key}: {value:.2%}")
elif 'ratio' in key:
print(f"{key}: {value:.2f}")
else:
print(f"{key}: {value:,.0f}")
# Compare with and without costs
backtester_no_costs = RealisticBacktester(
initial_capital=100000,
commission=0,
slippage=0,
max_position_size=1.0
)
results_no_costs = backtester_no_costs.run_backtest(results)
metrics_no_costs = backtester_no_costs.calculate_metrics(results_no_costs)
print("\nImpact of Transaction Costs:")
print(f"Return without costs: {metrics_no_costs['total_return']:.2%}")
print(f"Return with costs: {metrics['total_return']:.2%}")
print(f"Cost impact: {metrics_no_costs['total_return'] - metrics['total_return']:.2%}")
# Visualize backtest
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Equity curve
axes[0, 0].plot(backtest_results.index, backtest_results['capital'],
label='Strategy', linewidth=2)
# Buy and hold comparison
bh_capital = 100000 * (1 + backtest_results['future_return']).cumprod()
axes[0, 0].plot(backtest_results.index, bh_capital,
label='Buy & Hold', alpha=0.7, linewidth=2)
axes[0, 0].set_xlabel('Date')
axes[0, 0].set_ylabel('Portfolio Value ($)')
axes[0, 0].set_title('Equity Curve')
axes[0, 0].legend()
# Drawdown
peak = backtest_results['capital'].cummax()
drawdown = (backtest_results['capital'] - peak) / peak
axes[0, 1].fill_between(drawdown.index, drawdown.values, 0, alpha=0.7)
axes[0, 1].set_xlabel('Date')
axes[0, 1].set_ylabel('Drawdown')
axes[0, 1].set_title('Drawdown Over Time')
# Position over time
axes[1, 0].plot(backtest_results.index, backtest_results['position'],
linewidth=0.5)
axes[1, 0].set_xlabel('Date')
axes[1, 0].set_ylabel('Position')
axes[1, 0].set_title('Position Over Time')
# Cumulative costs
cum_costs = backtest_results['cost'].cumsum()
axes[1, 1].plot(backtest_results.index, cum_costs)
axes[1, 1].set_xlabel('Date')
axes[1, 1].set_ylabel('Cumulative Costs ($)')
axes[1, 1].set_title('Transaction Costs')
plt.tight_layout()
plt.show()
Section 4: Robustness Testing
Testing strategy robustness helps ensure performance isn't due to luck or overfitting.
# Monte Carlo Simulation for Strategy Robustness
def monte_carlo_robustness(returns, n_simulations=1000):
"""Test strategy robustness using Monte Carlo."""
np.random.seed(42)
original_sharpe = np.sqrt(252) * returns.mean() / (returns.std() + 1e-8)
original_total = (1 + returns).prod() - 1
simulated_sharpes = []
simulated_returns = []
for _ in range(n_simulations):
# Randomly shuffle returns
shuffled = np.random.permutation(returns)
sim_sharpe = np.sqrt(252) * shuffled.mean() / (shuffled.std() + 1e-8)
sim_total = (1 + shuffled).prod() - 1
simulated_sharpes.append(sim_sharpe)
simulated_returns.append(sim_total)
# Calculate percentiles
sharpe_percentile = (np.array(simulated_sharpes) < original_sharpe).mean() * 100
return_percentile = (np.array(simulated_returns) < original_total).mean() * 100
return {
'original_sharpe': original_sharpe,
'simulated_sharpes': simulated_sharpes,
'sharpe_percentile': sharpe_percentile,
'original_return': original_total,
'simulated_returns': simulated_returns,
'return_percentile': return_percentile
}
# Run Monte Carlo
strategy_returns = backtest_results['capital'].pct_change().dropna()
mc_results = monte_carlo_robustness(strategy_returns.values)
print("\nMonte Carlo Robustness Test:")
print(f"Strategy Sharpe: {mc_results['original_sharpe']:.2f}")
print(f"Sharpe percentile: {mc_results['sharpe_percentile']:.1f}%")
print(f"Strategy Return: {mc_results['original_return']:.2%}")
print(f"Return percentile: {mc_results['return_percentile']:.1f}%")
# Visualize Monte Carlo results
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Sharpe distribution
axes[0].hist(mc_results['simulated_sharpes'], bins=50, alpha=0.7, edgecolor='black')
axes[0].axvline(x=mc_results['original_sharpe'], color='red', linewidth=2,
label=f'Strategy: {mc_results["original_sharpe"]:.2f}')
axes[0].set_xlabel('Sharpe Ratio')
axes[0].set_ylabel('Frequency')
axes[0].set_title(f'Monte Carlo Sharpe Distribution\n(Percentile: {mc_results["sharpe_percentile"]:.1f}%)')
axes[0].legend()
# Return distribution
axes[1].hist(mc_results['simulated_returns'], bins=50, alpha=0.7, edgecolor='black')
axes[1].axvline(x=mc_results['original_return'], color='red', linewidth=2,
label=f'Strategy: {mc_results["original_return"]:.2%}')
axes[1].set_xlabel('Total Return')
axes[1].set_ylabel('Frequency')
axes[1].set_title(f'Monte Carlo Return Distribution\n(Percentile: {mc_results["return_percentile"]:.1f}%)')
axes[1].legend()
plt.tight_layout()
plt.show()
# Parameter Sensitivity Analysis
def sensitivity_analysis(df, feature_cols, param_name, param_values):
"""Analyze sensitivity to a parameter."""
results = []
X = df[feature_cols].values
y = df['target'].values
split_idx = int(len(X) * 0.7)
X_train, X_test = X[:split_idx], X[split_idx:]
y_train, y_test = y[:split_idx], y[split_idx:]
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
for value in param_values:
params = {param_name: value}
model = RandomForestClassifier(n_estimators=100, random_state=42, **params)
model.fit(X_train_scaled, y_train)
test_acc = accuracy_score(y_test, model.predict(X_test_scaled))
results.append({'param_value': value, 'test_accuracy': test_acc})
return pd.DataFrame(results)
# Test sensitivity to max_depth
depth_sensitivity = sensitivity_analysis(
df_features, feature_cols,
'max_depth', range(2, 15)
)
# Test sensitivity to n_estimators
n_est_sensitivity = sensitivity_analysis(
df_features, feature_cols,
'n_estimators', [10, 25, 50, 100, 200, 300]
)
print("Parameter Sensitivity:")
print("\nMax Depth:")
print(depth_sensitivity.to_string(index=False))
# Visualize sensitivity
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
axes[0].plot(depth_sensitivity['param_value'], depth_sensitivity['test_accuracy'],
marker='o', linewidth=2)
axes[0].set_xlabel('Max Depth')
axes[0].set_ylabel('Test Accuracy')
axes[0].set_title('Sensitivity to Max Depth')
axes[1].plot(n_est_sensitivity['param_value'], n_est_sensitivity['test_accuracy'],
marker='o', linewidth=2)
axes[1].set_xlabel('Number of Estimators')
axes[1].set_ylabel('Test Accuracy')
axes[1].set_title('Sensitivity to N Estimators')
plt.tight_layout()
plt.show()
Section 5: Module Project - Complete Backtesting System
Build a complete backtesting system with all proper safeguards.
# Complete ML Backtesting System
class MLBacktestingSystem:
"""Complete ML strategy backtesting system."""
def __init__(self, model, initial_capital=100000,
commission=0.001, slippage=0.0005):
self.model = model
self.initial_capital = initial_capital
self.commission = commission
self.slippage = slippage
self.scaler = StandardScaler()
self.walk_forward_results = None
self.backtest_results = None
def run_walk_forward(self, df, feature_cols, target_col='target',
train_window=252, test_window=21, step_size=21):
"""Run walk-forward optimization."""
X = df[feature_cols].values
y = df[target_col].values
dates = df.index
n_samples = len(X)
predictions = np.full(n_samples, np.nan)
probabilities = np.full(n_samples, np.nan)
start_idx = train_window
fold_results = []
while start_idx + test_window <= n_samples:
train_start = max(0, start_idx - train_window)
train_end = start_idx
test_start = start_idx
test_end = min(start_idx + test_window, n_samples)
X_train = X[train_start:train_end]
y_train = y[train_start:train_end]
X_test = X[test_start:test_end]
y_test = y[test_start:test_end]
X_train_scaled = self.scaler.fit_transform(X_train)
X_test_scaled = self.scaler.transform(X_test)
self.model.fit(X_train_scaled, y_train)
pred = self.model.predict(X_test_scaled)
prob = self.model.predict_proba(X_test_scaled)[:, 1]
predictions[test_start:test_end] = pred
probabilities[test_start:test_end] = prob
fold_results.append({
'test_start': dates[test_start],
'test_end': dates[test_end-1],
'accuracy': accuracy_score(y_test, pred)
})
start_idx += step_size
results = df.copy()
results['prediction'] = predictions
results['probability'] = probabilities
results['signal'] = np.where(predictions == 1, 1, -1)
self.walk_forward_results = results
self.fold_summary = pd.DataFrame(fold_results)
return results
def run_backtest(self, signal_col='signal', return_col='future_return'):
"""Run realistic backtest."""
if self.walk_forward_results is None:
raise ValueError("Run walk_forward first")
df = self.walk_forward_results.copy()
mask = ~df[signal_col].isna() & ~df[return_col].isna()
df = df[mask].copy()
capital = self.initial_capital
position = 0
capitals = []
trades = []
for _, row in df.iterrows():
signal = row[signal_col]
ret = row[return_col]
if signal != position:
cost = abs(signal - position) * capital * (self.commission + self.slippage)
capital -= cost
trades.append(1)
else:
trades.append(0)
position = signal
capital = capital * (1 + position * ret)
capitals.append(capital)
df['capital'] = capitals
df['trade'] = trades
self.backtest_results = df
return df
def calculate_metrics(self):
"""Calculate all performance metrics."""
if self.backtest_results is None:
raise ValueError("Run backtest first")
capitals = self.backtest_results['capital'].values
returns = np.diff(capitals) / capitals[:-1]
total_return = (capitals[-1] / capitals[0]) - 1
n_years = len(capitals) / 252
annual_return = (1 + total_return) ** (1/n_years) - 1
sharpe = np.sqrt(252) * np.mean(returns) / (np.std(returns) + 1e-8)
peak = np.maximum.accumulate(capitals)
drawdown = (capitals - peak) / peak
max_drawdown = np.min(drawdown)
# Walk-forward metrics
avg_fold_accuracy = self.fold_summary['accuracy'].mean()
accuracy_std = self.fold_summary['accuracy'].std()
return {
'total_return': total_return,
'annual_return': annual_return,
'sharpe_ratio': sharpe,
'max_drawdown': max_drawdown,
'avg_fold_accuracy': avg_fold_accuracy,
'accuracy_std': accuracy_std,
'n_trades': self.backtest_results['trade'].sum(),
'n_folds': len(self.fold_summary)
}
def run_robustness_tests(self, n_simulations=500):
"""Run Monte Carlo robustness tests."""
returns = self.backtest_results['capital'].pct_change().dropna().values
original_sharpe = np.sqrt(252) * returns.mean() / (returns.std() + 1e-8)
simulated_sharpes = []
for _ in range(n_simulations):
shuffled = np.random.permutation(returns)
sim_sharpe = np.sqrt(252) * shuffled.mean() / (shuffled.std() + 1e-8)
simulated_sharpes.append(sim_sharpe)
percentile = (np.array(simulated_sharpes) < original_sharpe).mean() * 100
return {
'original_sharpe': original_sharpe,
'sharpe_percentile': percentile,
'is_significant': percentile > 95
}
def generate_report(self):
"""Generate complete backtest report."""
metrics = self.calculate_metrics()
robustness = self.run_robustness_tests()
print("\n" + "="*60)
print("ML STRATEGY BACKTEST REPORT")
print("="*60)
print("\n--- Performance Metrics ---")
print(f"Total Return: {metrics['total_return']:.2%}")
print(f"Annual Return: {metrics['annual_return']:.2%}")
print(f"Sharpe Ratio: {metrics['sharpe_ratio']:.2f}")
print(f"Max Drawdown: {metrics['max_drawdown']:.2%}")
print("\n--- Walk-Forward Results ---")
print(f"Number of Folds: {metrics['n_folds']}")
print(f"Average Fold Accuracy: {metrics['avg_fold_accuracy']:.4f}")
print(f"Accuracy Std Dev: {metrics['accuracy_std']:.4f}")
print("\n--- Trading Statistics ---")
print(f"Number of Trades: {metrics['n_trades']}")
print("\n--- Robustness Tests ---")
print(f"Strategy Sharpe: {robustness['original_sharpe']:.2f}")
print(f"Sharpe Percentile: {robustness['sharpe_percentile']:.1f}%")
print(f"Statistically Significant: {robustness['is_significant']}")
return metrics, robustness
print("MLBacktestingSystem class defined")
# Run complete backtesting system
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
system = MLBacktestingSystem(
model=model,
initial_capital=100000,
commission=0.001,
slippage=0.0005
)
# Run walk-forward
wf_results = system.run_walk_forward(
df_features, feature_cols,
train_window=252,
test_window=21,
step_size=21
)
# Run backtest
bt_results = system.run_backtest()
# Generate report
metrics, robustness = system.generate_report()
# Comprehensive visualization
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
# Equity curve
axes[0, 0].plot(bt_results.index, bt_results['capital'], linewidth=2)
axes[0, 0].set_xlabel('Date')
axes[0, 0].set_ylabel('Portfolio Value ($)')
axes[0, 0].set_title('Equity Curve')
# Drawdown
peak = bt_results['capital'].cummax()
dd = (bt_results['capital'] - peak) / peak
axes[0, 1].fill_between(dd.index, dd.values, 0, alpha=0.7, color='red')
axes[0, 1].set_xlabel('Date')
axes[0, 1].set_ylabel('Drawdown')
axes[0, 1].set_title('Drawdown')
# Walk-forward accuracy
axes[0, 2].bar(range(len(system.fold_summary)), system.fold_summary['accuracy'])
axes[0, 2].axhline(y=0.5, color='red', linestyle='--')
axes[0, 2].axhline(y=system.fold_summary['accuracy'].mean(), color='green', linestyle='--')
axes[0, 2].set_xlabel('Fold')
axes[0, 2].set_ylabel('Accuracy')
axes[0, 2].set_title('Walk-Forward Accuracy')
# Monthly returns
monthly_returns = bt_results['capital'].resample('M').last().pct_change().dropna()
colors = ['green' if r > 0 else 'red' for r in monthly_returns]
axes[1, 0].bar(range(len(monthly_returns)), monthly_returns.values, color=colors)
axes[1, 0].set_xlabel('Month')
axes[1, 0].set_ylabel('Return')
axes[1, 0].set_title('Monthly Returns')
# Return distribution
daily_returns = bt_results['capital'].pct_change().dropna()
axes[1, 1].hist(daily_returns, bins=50, edgecolor='black', alpha=0.7)
axes[1, 1].axvline(x=0, color='red', linestyle='--')
axes[1, 1].set_xlabel('Daily Return')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].set_title('Return Distribution')
# Rolling Sharpe
rolling_sharpe = np.sqrt(252) * daily_returns.rolling(63).mean() / daily_returns.rolling(63).std()
axes[1, 2].plot(rolling_sharpe.index, rolling_sharpe.values)
axes[1, 2].axhline(y=0, color='red', linestyle='--')
axes[1, 2].set_xlabel('Date')
axes[1, 2].set_ylabel('Rolling Sharpe (63-day)')
axes[1, 2].set_title('Rolling Sharpe Ratio')
plt.tight_layout()
plt.show()
Exercises
Complete the following exercises to practice ML backtesting.
Exercise 12.1: Implement Walk-Forward Split (Guided)
Create a function that generates walk-forward train/test indices.
Solution 12.1
def walk_forward_split(n_samples, train_size, test_size, step_size):
splits = []
# Start from end of first training window
start_idx = train_size
while start_idx + test_size <= n_samples:
# Calculate train indices
train_start = max(0, start_idx - train_size)
train_end = start_idx
train_indices = list(range(train_start, train_end))
# Calculate test indices
test_start = start_idx
test_end = min(start_idx + test_size, n_samples)
test_indices = list(range(test_start, test_end))
splits.append((train_indices, test_indices))
# Move to next fold
start_idx += step_size
return splits
Exercise 12.2: Calculate Strategy Metrics (Guided)
Implement a function to calculate key performance metrics.
Solution 12.2
def calculate_strategy_metrics(returns):
returns = np.array(returns)
# Calculate total return
total_return = (1 + returns).prod() - 1
# Calculate Sharpe ratio (annualized)
sharpe = np.sqrt(252) * returns.mean() / (returns.std() + 1e-8)
# Calculate max drawdown
cumulative = (1 + returns).cumprod()
peak = np.maximum.accumulate(cumulative)
drawdown = (cumulative - peak) / peak
max_dd = drawdown.min()
# Calculate win rate
win_rate = (returns > 0).mean()
return {
'total_return': total_return,
'sharpe_ratio': sharpe,
'max_drawdown': max_dd,
'win_rate': win_rate
}
Exercise 12.3: Implement Transaction Cost Calculator (Guided)
Create a function that calculates transaction costs for a signal series.
Solution 12.3
def calculate_transaction_costs(signals, prices, commission=0.001, slippage=0.0005):
signals = np.array(signals)
prices = np.array(prices)
# Calculate position changes
position_changes = np.abs(np.diff(signals))
position_changes = np.insert(position_changes, 0, abs(signals[0]))
# Calculate trade values
trade_values = position_changes * prices
# Calculate costs
costs = trade_values * (commission + slippage)
return {
'total_cost': costs.sum(),
'n_trades': (position_changes > 0).sum(),
'costs_per_trade': costs
}
Exercise 12.4: Build Expanding Window Backtester (Open-ended)
Create a backtester that uses expanding training windows instead of fixed rolling windows.
Solution 12.4
class ExpandingWindowBacktester:
def __init__(self, model, min_train_samples=252, test_window=21):
self.model = model
self.min_train_samples = min_train_samples
self.test_window = test_window
self.scaler = StandardScaler()
self.results = []
def run(self, X, y):
n_samples = len(X)
predictions = np.full(n_samples, np.nan)
start_idx = self.min_train_samples
while start_idx + self.test_window <= n_samples:
# Expanding window: use ALL data from beginning
train_start = 0 # Always start from beginning
train_end = start_idx
test_start = start_idx
test_end = start_idx + self.test_window
X_train = X[train_start:train_end]
y_train = y[train_start:train_end]
X_test = X[test_start:test_end]
y_test = y[test_start:test_end]
X_train_scaled = self.scaler.fit_transform(X_train)
X_test_scaled = self.scaler.transform(X_test)
self.model.fit(X_train_scaled, y_train)
pred = self.model.predict(X_test_scaled)
predictions[test_start:test_end] = pred
self.results.append({
'train_size': len(X_train),
'accuracy': accuracy_score(y_test, pred)
})
start_idx += self.test_window
return predictions, pd.DataFrame(self.results)
# Compare with rolling
expanding = ExpandingWindowBacktester(
RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42),
min_train_samples=252
)
exp_pred, exp_results = expanding.run(X, y)
print(f"Expanding window avg accuracy: {exp_results['accuracy'].mean():.4f}")
Exercise 12.5: Implement Bootstrap Confidence Intervals (Open-ended)
Create a function that calculates bootstrap confidence intervals for strategy metrics.
Solution 12.5
def bootstrap_confidence_intervals(returns, n_bootstrap=1000, confidence_level=0.95):
"""Calculate bootstrap confidence intervals for strategy metrics."""
np.random.seed(42)
returns = np.array(returns)
n_samples = len(returns)
bootstrap_sharpes = []
bootstrap_returns = []
for _ in range(n_bootstrap):
# Resample with replacement
sample_indices = np.random.choice(n_samples, size=n_samples, replace=True)
sample_returns = returns[sample_indices]
# Calculate metrics
sharpe = np.sqrt(252) * sample_returns.mean() / (sample_returns.std() + 1e-8)
total_ret = (1 + sample_returns).prod() - 1
bootstrap_sharpes.append(sharpe)
bootstrap_returns.append(total_ret)
# Calculate confidence intervals
alpha = (1 - confidence_level) / 2
sharpe_ci = (
np.percentile(bootstrap_sharpes, alpha * 100),
np.percentile(bootstrap_sharpes, (1 - alpha) * 100)
)
return_ci = (
np.percentile(bootstrap_returns, alpha * 100),
np.percentile(bootstrap_returns, (1 - alpha) * 100)
)
# Is significantly positive?
sharpe_significant = sharpe_ci[0] > 0
return {
'sharpe_ci': sharpe_ci,
'return_ci': return_ci,
'sharpe_significant': sharpe_significant,
'original_sharpe': np.sqrt(252) * returns.mean() / returns.std()
}
# Test
strategy_returns = backtest_results['capital'].pct_change().dropna().values
ci_results = bootstrap_confidence_intervals(strategy_returns)
print(f"Sharpe 95% CI: ({ci_results['sharpe_ci'][0]:.2f}, {ci_results['sharpe_ci'][1]:.2f})")
print(f"Statistically significant: {ci_results['sharpe_significant']}")
Exercise 12.6: Build Regime-Aware Backtester (Open-ended)
Create a backtester that tracks performance across different market regimes.
Solution 12.6
class RegimeAwareBacktester:
def __init__(self, lookback=63):
self.lookback = lookback
def identify_regimes(self, prices):
"""Identify market regimes based on price trend."""
returns = pd.Series(prices).pct_change()
rolling_return = returns.rolling(self.lookback).mean() * 252
rolling_vol = returns.rolling(self.lookback).std() * np.sqrt(252)
regimes = pd.Series(index=range(len(prices)), dtype=str)
for i in range(len(prices)):
if pd.isna(rolling_return.iloc[i]):
regimes.iloc[i] = 'unknown'
elif rolling_return.iloc[i] > 0.1: # >10% annualized
regimes.iloc[i] = 'bull'
elif rolling_return.iloc[i] < -0.1: # <-10% annualized
regimes.iloc[i] = 'bear'
else:
regimes.iloc[i] = 'sideways'
return regimes
def analyze_by_regime(self, strategy_returns, prices):
"""Analyze strategy performance by regime."""
regimes = self.identify_regimes(prices)
results = {}
for regime in ['bull', 'bear', 'sideways']:
mask = regimes == regime
regime_returns = strategy_returns[mask]
if len(regime_returns) > 0:
results[regime] = {
'n_days': len(regime_returns),
'total_return': (1 + regime_returns).prod() - 1,
'sharpe': np.sqrt(252) * regime_returns.mean() / (regime_returns.std() + 1e-8),
'win_rate': (regime_returns > 0).mean()
}
return pd.DataFrame(results).T
# Usage
regime_analyzer = RegimeAwareBacktester(lookback=63)
strategy_rets = backtest_results['capital'].pct_change().dropna().values
prices = backtest_results['close'].values
regime_results = regime_analyzer.analyze_by_regime(strategy_rets, prices)
print("Performance by Regime:")
print(regime_results)
Summary
In this module, you learned:
-
Walk-Forward Optimization: Proper methodology for testing ML strategies on unseen data
-
Avoiding Pitfalls: Look-ahead bias, survivorship bias, and overfitting detection
-
Realistic Backtesting: Accounting for transaction costs, slippage, and execution
-
Robustness Testing: Monte Carlo simulations and parameter sensitivity analysis
-
Complete Systems: Building production-ready backtesting frameworks
Key Takeaways
- Walk-forward validation is essential for ML strategies to avoid look-ahead bias
- Transaction costs can significantly impact strategy performance
- Monte Carlo tests help distinguish skill from luck
- Overfitting is the #1 enemy of ML trading strategies
- Always test robustness before deploying any strategy
Next Steps
In Module 13, you'll learn about deploying ML models to production, including feature pipelines, model monitoring, and system architecture.
Module 13: Production ML Systems
Overview
Moving ML models from research to production requires careful engineering. This module covers the infrastructure, pipelines, and monitoring needed to deploy ML trading systems reliably.
Learning Objectives
By the end of this module, you will be able to: - Design feature pipelines for real-time prediction - Implement model versioning and deployment strategies - Build monitoring systems to detect model degradation - Create robust error handling and fallback mechanisms
Prerequisites
- Module 11: Deep Learning for Finance
- Module 12: Backtesting ML Strategies
- Basic understanding of software engineering principles
Estimated Time: 3.5 hours
Section 1: Feature Pipeline Architecture
A robust feature pipeline ensures consistent feature computation between training and inference.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
from typing import Dict, List, Any, Optional
from dataclasses import dataclass, field
from abc import ABC, abstractmethod
import json
import hashlib
import pickle
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')
np.random.seed(42)
print("Production ML libraries loaded")
# Feature Definition Framework
@dataclass
class FeatureDefinition:
"""Defines a single feature for the pipeline."""
name: str
feature_type: str # 'price', 'volume', 'technical', 'derived'
lookback_periods: int
dependencies: List[str] = field(default_factory=list)
params: Dict[str, Any] = field(default_factory=dict)
def to_dict(self) -> Dict:
return {
'name': self.name,
'feature_type': self.feature_type,
'lookback_periods': self.lookback_periods,
'dependencies': self.dependencies,
'params': self.params
}
class FeatureRegistry:
"""Central registry for all feature definitions."""
def __init__(self):
self.features: Dict[str, FeatureDefinition] = {}
self.computation_order: List[str] = []
def register(self, feature: FeatureDefinition):
"""Register a feature definition."""
self.features[feature.name] = feature
self._update_computation_order()
def _update_computation_order(self):
"""Topologically sort features based on dependencies."""
visited = set()
order = []
def visit(name):
if name in visited:
return
visited.add(name)
if name in self.features:
for dep in self.features[name].dependencies:
visit(dep)
order.append(name)
for name in self.features:
visit(name)
self.computation_order = order
def get_max_lookback(self) -> int:
"""Get maximum lookback period needed."""
return max(f.lookback_periods for f in self.features.values())
def get_feature_hash(self) -> str:
"""Generate hash of feature definitions for versioning."""
feature_str = json.dumps(
{name: f.to_dict() for name, f in sorted(self.features.items())},
sort_keys=True
)
return hashlib.md5(feature_str.encode()).hexdigest()[:8]
print("Feature definition framework created")
# Feature Computation Engine
class FeatureComputer(ABC):
"""Abstract base class for feature computation."""
@abstractmethod
def compute(self, df: pd.DataFrame, params: Dict) -> pd.Series:
pass
class ReturnFeature(FeatureComputer):
"""Compute return features."""
def compute(self, df: pd.DataFrame, params: Dict) -> pd.Series:
period = params.get('period', 1)
return df['close'].pct_change(period)
class VolatilityFeature(FeatureComputer):
"""Compute volatility features."""
def compute(self, df: pd.DataFrame, params: Dict) -> pd.Series:
period = params.get('period', 20)
returns = df['close'].pct_change()
return returns.rolling(period).std()
class SMAFeature(FeatureComputer):
"""Compute simple moving average features."""
def compute(self, df: pd.DataFrame, params: Dict) -> pd.Series:
period = params.get('period', 20)
return df['close'].rolling(period).mean()
class RSIFeature(FeatureComputer):
"""Compute RSI feature."""
def compute(self, df: pd.DataFrame, params: Dict) -> pd.Series:
period = params.get('period', 14)
delta = df['close'].diff()
gain = (delta.where(delta > 0, 0)).rolling(period).mean()
loss = (-delta.where(delta < 0, 0)).rolling(period).mean()
rs = gain / (loss + 1e-10)
return 100 - (100 / (1 + rs))
class MACDFeature(FeatureComputer):
"""Compute MACD feature."""
def compute(self, df: pd.DataFrame, params: Dict) -> pd.Series:
fast = params.get('fast', 12)
slow = params.get('slow', 26)
exp_fast = df['close'].ewm(span=fast).mean()
exp_slow = df['close'].ewm(span=slow).mean()
return exp_fast - exp_slow
class PriceToSMAFeature(FeatureComputer):
"""Compute price relative to SMA."""
def compute(self, df: pd.DataFrame, params: Dict) -> pd.Series:
period = params.get('period', 20)
sma = df['close'].rolling(period).mean()
return df['close'] / sma
# Feature Computer Registry
FEATURE_COMPUTERS = {
'return': ReturnFeature(),
'volatility': VolatilityFeature(),
'sma': SMAFeature(),
'rsi': RSIFeature(),
'macd': MACDFeature(),
'price_to_sma': PriceToSMAFeature()
}
print("Feature computers registered")
# Production Feature Pipeline
class FeaturePipeline:
"""Production-ready feature pipeline."""
def __init__(self, registry: FeatureRegistry):
self.registry = registry
self.scaler = StandardScaler()
self.is_fitted = False
self.feature_stats = {}
def compute_features(self, df: pd.DataFrame) -> pd.DataFrame:
"""Compute all registered features."""
result = df.copy()
for feature_name in self.registry.computation_order:
feature_def = self.registry.features[feature_name]
computer = FEATURE_COMPUTERS.get(feature_def.feature_type)
if computer:
result[feature_name] = computer.compute(result, feature_def.params)
return result
def fit(self, df: pd.DataFrame):
"""Fit the pipeline on training data."""
features_df = self.compute_features(df)
feature_cols = list(self.registry.features.keys())
# Store feature statistics
for col in feature_cols:
self.feature_stats[col] = {
'mean': features_df[col].mean(),
'std': features_df[col].std(),
'min': features_df[col].min(),
'max': features_df[col].max()
}
# Fit scaler
valid_data = features_df[feature_cols].dropna()
self.scaler.fit(valid_data)
self.is_fitted = True
def transform(self, df: pd.DataFrame) -> pd.DataFrame:
"""Transform data using fitted pipeline."""
if not self.is_fitted:
raise ValueError("Pipeline not fitted. Call fit() first.")
features_df = self.compute_features(df)
feature_cols = list(self.registry.features.keys())
# Scale features
valid_mask = ~features_df[feature_cols].isna().any(axis=1)
result = features_df.copy()
result.loc[valid_mask, feature_cols] = self.scaler.transform(
features_df.loc[valid_mask, feature_cols]
)
return result
def fit_transform(self, df: pd.DataFrame) -> pd.DataFrame:
"""Fit and transform in one step."""
self.fit(df)
return self.transform(df)
def get_feature_vector(self, df: pd.DataFrame) -> np.ndarray:
"""Get feature vector for prediction."""
transformed = self.transform(df)
feature_cols = list(self.registry.features.keys())
return transformed[feature_cols].iloc[-1].values
def validate_features(self, df: pd.DataFrame) -> Dict[str, Any]:
"""Validate features against training statistics."""
features_df = self.compute_features(df)
validation_results = {}
for col, stats in self.feature_stats.items():
current_value = features_df[col].iloc[-1]
# Check for outliers
z_score = (current_value - stats['mean']) / (stats['std'] + 1e-10)
is_outlier = abs(z_score) > 3
validation_results[col] = {
'value': current_value,
'z_score': z_score,
'is_outlier': is_outlier
}
return validation_results
def save(self, filepath: str):
"""Save pipeline to file."""
state = {
'registry': self.registry,
'scaler': self.scaler,
'feature_stats': self.feature_stats,
'is_fitted': self.is_fitted,
'feature_hash': self.registry.get_feature_hash()
}
with open(filepath, 'wb') as f:
pickle.dump(state, f)
@classmethod
def load(cls, filepath: str) -> 'FeaturePipeline':
"""Load pipeline from file."""
with open(filepath, 'rb') as f:
state = pickle.load(f)
pipeline = cls(state['registry'])
pipeline.scaler = state['scaler']
pipeline.feature_stats = state['feature_stats']
pipeline.is_fitted = state['is_fitted']
return pipeline
print("FeaturePipeline class defined")
# Create and test feature pipeline
# Generate sample data
def generate_sample_data(n_samples=1000):
np.random.seed(42)
dates = pd.date_range(start='2020-01-01', periods=n_samples, freq='D')
returns = np.random.normal(0.0003, 0.015, n_samples)
prices = 100 * np.exp(np.cumsum(returns))
return pd.DataFrame({
'date': dates,
'open': np.roll(prices, 1),
'high': prices * (1 + np.abs(np.random.normal(0, 0.01, n_samples))),
'low': prices * (1 - np.abs(np.random.normal(0, 0.01, n_samples))),
'close': prices,
'volume': np.random.lognormal(15, 0.5, n_samples)
}).set_index('date')
df = generate_sample_data()
# Create feature registry
registry = FeatureRegistry()
# Register features
registry.register(FeatureDefinition('return_1d', 'return', 1, params={'period': 1}))
registry.register(FeatureDefinition('return_5d', 'return', 5, params={'period': 5}))
registry.register(FeatureDefinition('volatility_20d', 'volatility', 20, params={'period': 20}))
registry.register(FeatureDefinition('rsi', 'rsi', 14, params={'period': 14}))
registry.register(FeatureDefinition('macd', 'macd', 26, params={'fast': 12, 'slow': 26}))
registry.register(FeatureDefinition('price_to_sma_20', 'price_to_sma', 20, params={'period': 20}))
# Create pipeline
pipeline = FeaturePipeline(registry)
# Fit on training data
train_df = df.iloc[:800]
test_df = df.iloc[800:]
pipeline.fit(train_df)
print(f"Feature hash: {registry.get_feature_hash()}")
print(f"Max lookback: {registry.get_max_lookback()} days")
print(f"Computation order: {registry.computation_order}")
Section 2: Model Versioning and Deployment
Proper model versioning ensures reproducibility and enables rollback if needed.
# Model Versioning System
@dataclass
class ModelVersion:
"""Represents a versioned model."""
version_id: str
model_type: str
feature_hash: str
created_at: datetime
metrics: Dict[str, float]
hyperparameters: Dict[str, Any]
is_active: bool = False
def to_dict(self) -> Dict:
return {
'version_id': self.version_id,
'model_type': self.model_type,
'feature_hash': self.feature_hash,
'created_at': self.created_at.isoformat(),
'metrics': self.metrics,
'hyperparameters': self.hyperparameters,
'is_active': self.is_active
}
class ModelRegistry:
"""Registry for model versions."""
def __init__(self):
self.versions: Dict[str, ModelVersion] = {}
self.models: Dict[str, Any] = {}
self.active_version: Optional[str] = None
def register_model(self, model, version: ModelVersion):
"""Register a new model version."""
self.versions[version.version_id] = version
self.models[version.version_id] = model
print(f"Registered model version: {version.version_id}")
def activate_version(self, version_id: str):
"""Activate a specific model version."""
if version_id not in self.versions:
raise ValueError(f"Version {version_id} not found")
# Deactivate current
if self.active_version:
self.versions[self.active_version].is_active = False
# Activate new
self.versions[version_id].is_active = True
self.active_version = version_id
print(f"Activated version: {version_id}")
def get_active_model(self):
"""Get the currently active model."""
if not self.active_version:
raise ValueError("No active model version")
return self.models[self.active_version]
def get_version_history(self) -> pd.DataFrame:
"""Get version history as dataframe."""
records = [v.to_dict() for v in self.versions.values()]
return pd.DataFrame(records)
def rollback(self, version_id: str):
"""Rollback to a previous version."""
if version_id not in self.versions:
raise ValueError(f"Version {version_id} not found")
self.activate_version(version_id)
print(f"Rolled back to version: {version_id}")
def compare_versions(self, version_ids: List[str]) -> pd.DataFrame:
"""Compare metrics across versions."""
comparisons = []
for vid in version_ids:
if vid in self.versions:
v = self.versions[vid]
record = {'version_id': vid, **v.metrics}
comparisons.append(record)
return pd.DataFrame(comparisons)
print("Model versioning system defined")
# Model Deployment Manager
class ModelDeploymentManager:
"""Manages model deployment lifecycle."""
def __init__(self, model_registry: ModelRegistry,
feature_pipeline: FeaturePipeline):
self.model_registry = model_registry
self.feature_pipeline = feature_pipeline
self.deployment_history = []
def train_new_version(self, train_data: pd.DataFrame,
model_class, hyperparameters: Dict,
target_col: str = 'target') -> str:
"""Train and register a new model version."""
# Prepare features
features_df = self.feature_pipeline.fit_transform(train_data)
feature_cols = list(self.feature_pipeline.registry.features.keys())
# Prepare target
features_df['target'] = (features_df['close'].shift(-1) >
features_df['close']).astype(int)
# Remove NaN
valid_data = features_df.dropna()
X = valid_data[feature_cols].values
y = valid_data[target_col].values
# Train model
model = model_class(**hyperparameters)
model.fit(X, y)
# Calculate metrics
train_pred = model.predict(X)
accuracy = (train_pred == y).mean()
# Create version
version_id = f"v_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
version = ModelVersion(
version_id=version_id,
model_type=model_class.__name__,
feature_hash=self.feature_pipeline.registry.get_feature_hash(),
created_at=datetime.now(),
metrics={'train_accuracy': accuracy},
hyperparameters=hyperparameters
)
# Register
self.model_registry.register_model(model, version)
return version_id
def validate_before_deploy(self, version_id: str,
validation_data: pd.DataFrame,
min_accuracy: float = 0.5) -> bool:
"""Validate model before deployment."""
model = self.model_registry.models[version_id]
# Prepare validation data
features_df = self.feature_pipeline.transform(validation_data)
feature_cols = list(self.feature_pipeline.registry.features.keys())
features_df['target'] = (features_df['close'].shift(-1) >
features_df['close']).astype(int)
valid_data = features_df.dropna()
X = valid_data[feature_cols].values
y = valid_data['target'].values
# Validate
predictions = model.predict(X)
accuracy = (predictions == y).mean()
# Update version metrics
self.model_registry.versions[version_id].metrics['val_accuracy'] = accuracy
is_valid = accuracy >= min_accuracy
print(f"Validation accuracy: {accuracy:.4f} - {'PASSED' if is_valid else 'FAILED'}")
return is_valid
def deploy(self, version_id: str, force: bool = False):
"""Deploy a model version."""
if not force:
val_acc = self.model_registry.versions[version_id].metrics.get('val_accuracy')
if val_acc is None:
raise ValueError("Model not validated. Run validate_before_deploy() first.")
self.model_registry.activate_version(version_id)
self.deployment_history.append({
'version_id': version_id,
'deployed_at': datetime.now(),
'action': 'deploy'
})
print(f"Deployed version: {version_id}")
def predict(self, data: pd.DataFrame) -> np.ndarray:
"""Make predictions using active model."""
model = self.model_registry.get_active_model()
features_df = self.feature_pipeline.transform(data)
feature_cols = list(self.feature_pipeline.registry.features.keys())
X = features_df[feature_cols].dropna().values
return model.predict(X)
print("ModelDeploymentManager defined")
# Test deployment workflow
model_registry = ModelRegistry()
deployment_manager = ModelDeploymentManager(model_registry, pipeline)
# Train first version
v1_id = deployment_manager.train_new_version(
train_data=train_df,
model_class=RandomForestClassifier,
hyperparameters={'n_estimators': 50, 'max_depth': 3, 'random_state': 42}
)
# Validate
is_valid = deployment_manager.validate_before_deploy(v1_id, test_df)
# Deploy if valid
if is_valid:
deployment_manager.deploy(v1_id)
# Show version history
print("\nVersion History:")
print(model_registry.get_version_history())
Section 3: Model Monitoring
Continuous monitoring detects model degradation and data drift.
# Model Monitoring System
class ModelMonitor:
"""Monitor model performance and data drift."""
def __init__(self, feature_pipeline: FeaturePipeline,
alert_threshold: float = 0.1):
self.feature_pipeline = feature_pipeline
self.alert_threshold = alert_threshold
self.prediction_log = []
self.performance_history = []
self.drift_alerts = []
def log_prediction(self, timestamp: datetime, features: Dict,
prediction: int, probability: float,
actual: Optional[int] = None):
"""Log a prediction for monitoring."""
self.prediction_log.append({
'timestamp': timestamp,
'features': features,
'prediction': prediction,
'probability': probability,
'actual': actual
})
def update_actual(self, timestamp: datetime, actual: int):
"""Update actual outcome for a prediction."""
for log in self.prediction_log:
if log['timestamp'] == timestamp:
log['actual'] = actual
break
def calculate_rolling_accuracy(self, window: int = 20) -> float:
"""Calculate rolling accuracy."""
recent_logs = [l for l in self.prediction_log[-window:]
if l['actual'] is not None]
if not recent_logs:
return None
correct = sum(1 for l in recent_logs if l['prediction'] == l['actual'])
return correct / len(recent_logs)
def detect_feature_drift(self, current_data: pd.DataFrame) -> Dict:
"""Detect drift in feature distributions."""
drift_results = {}
validation = self.feature_pipeline.validate_features(current_data)
for feature_name, stats in validation.items():
drift_results[feature_name] = {
'z_score': stats['z_score'],
'is_drifted': stats['is_outlier']
}
if stats['is_outlier']:
self.drift_alerts.append({
'timestamp': datetime.now(),
'feature': feature_name,
'z_score': stats['z_score']
})
return drift_results
def detect_prediction_drift(self, window: int = 100) -> Dict:
"""Detect drift in prediction distribution."""
recent_logs = self.prediction_log[-window:]
if len(recent_logs) < window // 2:
return {'status': 'insufficient_data'}
# Calculate prediction distribution
predictions = [l['prediction'] for l in recent_logs]
probabilities = [l['probability'] for l in recent_logs]
# Split into first and second half
mid = len(predictions) // 2
first_half_mean = np.mean(predictions[:mid])
second_half_mean = np.mean(predictions[mid:])
drift_score = abs(second_half_mean - first_half_mean)
return {
'status': 'ok' if drift_score < self.alert_threshold else 'drift_detected',
'drift_score': drift_score,
'first_half_mean': first_half_mean,
'second_half_mean': second_half_mean
}
def check_performance_degradation(self, baseline_accuracy: float,
window: int = 50) -> Dict:
"""Check for performance degradation."""
current_accuracy = self.calculate_rolling_accuracy(window)
if current_accuracy is None:
return {'status': 'insufficient_data'}
degradation = baseline_accuracy - current_accuracy
return {
'status': 'ok' if degradation < self.alert_threshold else 'degraded',
'baseline_accuracy': baseline_accuracy,
'current_accuracy': current_accuracy,
'degradation': degradation
}
def generate_monitoring_report(self) -> Dict:
"""Generate comprehensive monitoring report."""
return {
'timestamp': datetime.now(),
'total_predictions': len(self.prediction_log),
'predictions_with_actual': sum(1 for l in self.prediction_log
if l['actual'] is not None),
'rolling_accuracy_20': self.calculate_rolling_accuracy(20),
'rolling_accuracy_50': self.calculate_rolling_accuracy(50),
'drift_alerts_count': len(self.drift_alerts),
'recent_drift_alerts': self.drift_alerts[-5:]
}
print("ModelMonitor class defined")
# Alert System
class AlertManager:
"""Manage alerts for model monitoring."""
def __init__(self):
self.alerts = []
self.alert_handlers = []
def register_handler(self, handler_func):
"""Register an alert handler function."""
self.alert_handlers.append(handler_func)
def raise_alert(self, alert_type: str, severity: str,
message: str, details: Dict = None):
"""Raise an alert."""
alert = {
'timestamp': datetime.now(),
'type': alert_type,
'severity': severity,
'message': message,
'details': details or {}
}
self.alerts.append(alert)
# Trigger handlers
for handler in self.alert_handlers:
handler(alert)
print(f"[{severity.upper()}] {alert_type}: {message}")
def get_alerts(self, severity: Optional[str] = None,
since: Optional[datetime] = None) -> List[Dict]:
"""Get alerts with optional filtering."""
filtered = self.alerts
if severity:
filtered = [a for a in filtered if a['severity'] == severity]
if since:
filtered = [a for a in filtered if a['timestamp'] >= since]
return filtered
# Example handler
def print_handler(alert):
if alert['severity'] == 'critical':
print(f"!!! CRITICAL ALERT: {alert['message']} !!!")
alert_manager = AlertManager()
alert_manager.register_handler(print_handler)
print("AlertManager configured")
# Simulate monitoring
monitor = ModelMonitor(pipeline)
# Simulate predictions
np.random.seed(42)
for i in range(100):
# Simulate prediction
prediction = np.random.choice([0, 1])
probability = 0.5 + np.random.uniform(-0.3, 0.3)
actual = np.random.choice([0, 1])
monitor.log_prediction(
timestamp=datetime.now() + timedelta(days=i),
features={'return_1d': np.random.normal(0, 0.02)},
prediction=prediction,
probability=probability,
actual=actual
)
# Generate report
report = monitor.generate_monitoring_report()
print("\nMonitoring Report:")
for key, value in report.items():
if key != 'recent_drift_alerts':
print(f" {key}: {value}")
Section 4: Error Handling and Fallbacks
Production systems need robust error handling and graceful degradation.
# Production Prediction Service
class PredictionService:
"""Production-ready prediction service with fallbacks."""
def __init__(self, deployment_manager: ModelDeploymentManager,
monitor: ModelMonitor,
alert_manager: AlertManager):
self.deployment_manager = deployment_manager
self.monitor = monitor
self.alert_manager = alert_manager
self.fallback_prediction = 0 # Conservative: no position
self.request_count = 0
self.error_count = 0
def predict(self, data: pd.DataFrame) -> Dict:
"""Make prediction with error handling."""
self.request_count += 1
result = {
'timestamp': datetime.now(),
'status': 'success',
'prediction': None,
'probability': None,
'is_fallback': False,
'warnings': []
}
try:
# Validate input data
if len(data) < self.deployment_manager.feature_pipeline.registry.get_max_lookback():
result['warnings'].append('Insufficient data for full lookback')
# Check for data quality
if data['close'].isna().any():
raise ValueError("Missing price data")
# Detect feature drift
drift_results = self.monitor.detect_feature_drift(data)
drifted_features = [f for f, d in drift_results.items() if d['is_drifted']]
if drifted_features:
result['warnings'].append(f"Feature drift detected: {drifted_features}")
self.alert_manager.raise_alert(
'feature_drift', 'warning',
f"Drift detected in features: {drifted_features}"
)
# Make prediction
model = self.deployment_manager.model_registry.get_active_model()
features_df = self.deployment_manager.feature_pipeline.transform(data)
feature_cols = list(self.deployment_manager.feature_pipeline.registry.features.keys())
X = features_df[feature_cols].iloc[-1:].values
if np.isnan(X).any():
raise ValueError("NaN values in features")
prediction = model.predict(X)[0]
probability = model.predict_proba(X)[0, 1]
result['prediction'] = int(prediction)
result['probability'] = float(probability)
# Log prediction
self.monitor.log_prediction(
timestamp=result['timestamp'],
features=dict(zip(feature_cols, X[0])),
prediction=prediction,
probability=probability
)
except Exception as e:
self.error_count += 1
result['status'] = 'fallback'
result['prediction'] = self.fallback_prediction
result['probability'] = 0.5
result['is_fallback'] = True
result['error'] = str(e)
# Alert on errors
error_rate = self.error_count / self.request_count
if error_rate > 0.1:
self.alert_manager.raise_alert(
'high_error_rate', 'critical',
f"Error rate: {error_rate:.2%}",
{'error_count': self.error_count, 'request_count': self.request_count}
)
return result
def health_check(self) -> Dict:
"""Check service health."""
return {
'status': 'healthy' if self.error_count / max(1, self.request_count) < 0.1 else 'degraded',
'request_count': self.request_count,
'error_count': self.error_count,
'error_rate': self.error_count / max(1, self.request_count),
'active_model': self.deployment_manager.model_registry.active_version
}
print("PredictionService class defined")
# Test prediction service
service = PredictionService(deployment_manager, monitor, alert_manager)
# Normal prediction
result = service.predict(test_df)
print("\nPrediction Result:")
for key, value in result.items():
print(f" {key}: {value}")
# Health check
print("\nHealth Check:")
health = service.health_check()
for key, value in health.items():
print(f" {key}: {value}")
Section 5: Module Project - Complete Production System
Build a complete production ML trading system.
# Complete Production ML Trading System
class ProductionTradingSystem:
"""Complete production ML trading system."""
def __init__(self, initial_capital: float = 100000):
self.initial_capital = initial_capital
self.capital = initial_capital
self.position = 0
# Initialize components
self.feature_registry = FeatureRegistry()
self._setup_features()
self.feature_pipeline = FeaturePipeline(self.feature_registry)
self.model_registry = ModelRegistry()
self.deployment_manager = ModelDeploymentManager(
self.model_registry, self.feature_pipeline
)
self.monitor = ModelMonitor(self.feature_pipeline)
self.alert_manager = AlertManager()
self.prediction_service = PredictionService(
self.deployment_manager, self.monitor, self.alert_manager
)
# Trading state
self.trade_history = []
self.equity_curve = []
def _setup_features(self):
"""Setup standard feature set."""
features = [
FeatureDefinition('return_1d', 'return', 1, params={'period': 1}),
FeatureDefinition('return_5d', 'return', 5, params={'period': 5}),
FeatureDefinition('return_20d', 'return', 20, params={'period': 20}),
FeatureDefinition('volatility_20d', 'volatility', 20, params={'period': 20}),
FeatureDefinition('rsi', 'rsi', 14, params={'period': 14}),
FeatureDefinition('macd', 'macd', 26, params={'fast': 12, 'slow': 26}),
FeatureDefinition('price_to_sma_20', 'price_to_sma', 20, params={'period': 20}),
FeatureDefinition('price_to_sma_50', 'price_to_sma', 50, params={'period': 50}),
]
for feature in features:
self.feature_registry.register(feature)
def train(self, train_data: pd.DataFrame,
model_class=RandomForestClassifier,
hyperparameters: Dict = None):
"""Train and deploy a model."""
if hyperparameters is None:
hyperparameters = {
'n_estimators': 100,
'max_depth': 5,
'random_state': 42
}
# Train new version
version_id = self.deployment_manager.train_new_version(
train_data, model_class, hyperparameters
)
return version_id
def validate_and_deploy(self, version_id: str,
validation_data: pd.DataFrame,
min_accuracy: float = 0.5):
"""Validate and deploy a model version."""
is_valid = self.deployment_manager.validate_before_deploy(
version_id, validation_data, min_accuracy
)
if is_valid:
self.deployment_manager.deploy(version_id)
return True
else:
self.alert_manager.raise_alert(
'validation_failed', 'warning',
f"Model {version_id} failed validation"
)
return False
def process_bar(self, current_data: pd.DataFrame,
current_price: float) -> Dict:
"""Process a new bar and potentially trade."""
# Get prediction
prediction_result = self.prediction_service.predict(current_data)
# Determine position
signal = 1 if prediction_result['prediction'] == 1 else -1
trade_result = None
# Check for position change
if signal != self.position:
trade_result = self._execute_trade(signal, current_price)
# Update equity
self.equity_curve.append({
'timestamp': datetime.now(),
'capital': self.capital,
'position': self.position
})
return {
'prediction': prediction_result,
'signal': signal,
'trade': trade_result,
'capital': self.capital,
'position': self.position
}
def _execute_trade(self, new_position: int, price: float) -> Dict:
"""Execute a trade."""
# Calculate trade cost (0.1% commission + 0.05% slippage)
cost_rate = 0.0015
position_change = abs(new_position - self.position)
trade_cost = self.capital * position_change * cost_rate
self.capital -= trade_cost
self.position = new_position
trade = {
'timestamp': datetime.now(),
'price': price,
'new_position': new_position,
'cost': trade_cost
}
self.trade_history.append(trade)
return trade
def update_pnl(self, price_return: float):
"""Update P&L based on position and return."""
pnl = self.capital * self.position * price_return
self.capital += pnl
return pnl
def get_performance_summary(self) -> Dict:
"""Get performance summary."""
if not self.equity_curve:
return {'status': 'no_data'}
equity_df = pd.DataFrame(self.equity_curve)
capitals = equity_df['capital'].values
returns = np.diff(capitals) / capitals[:-1]
return {
'total_return': (self.capital / self.initial_capital) - 1,
'sharpe_ratio': np.sqrt(252) * np.mean(returns) / (np.std(returns) + 1e-8) if len(returns) > 0 else 0,
'max_drawdown': (capitals / np.maximum.accumulate(capitals) - 1).min() if len(capitals) > 0 else 0,
'n_trades': len(self.trade_history),
'total_costs': sum(t['cost'] for t in self.trade_history),
'active_model': self.model_registry.active_version,
'health': self.prediction_service.health_check()
}
def generate_system_report(self) -> str:
"""Generate comprehensive system report."""
perf = self.get_performance_summary()
monitoring = self.monitor.generate_monitoring_report()
report = f"""
========================================
PRODUCTION TRADING SYSTEM REPORT
========================================
Generated: {datetime.now()}
--- Performance ---
Total Return: {perf.get('total_return', 0):.2%}
Sharpe Ratio: {perf.get('sharpe_ratio', 0):.2f}
Max Drawdown: {perf.get('max_drawdown', 0):.2%}
Number of Trades: {perf.get('n_trades', 0)}
Total Costs: ${perf.get('total_costs', 0):,.2f}
--- Model ---
Active Model: {perf.get('active_model', 'None')}
Feature Hash: {self.feature_registry.get_feature_hash()}
--- Monitoring ---
Total Predictions: {monitoring.get('total_predictions', 0)}
Rolling Accuracy (20): {monitoring.get('rolling_accuracy_20', 'N/A')}
Drift Alerts: {monitoring.get('drift_alerts_count', 0)}
--- Health ---
Status: {perf.get('health', {}).get('status', 'Unknown')}
Error Rate: {perf.get('health', {}).get('error_rate', 0):.2%}
========================================
"""
return report
print("ProductionTradingSystem class defined")
# Run complete production system
# Generate more data
full_data = generate_sample_data(1500)
# Split data
train_data = full_data.iloc[:1000]
val_data = full_data.iloc[1000:1200]
test_data = full_data.iloc[1200:]
# Initialize system
system = ProductionTradingSystem(initial_capital=100000)
# Train model
print("Training model...")
version_id = system.train(train_data)
# Validate and deploy
print("\nValidating and deploying...")
deployed = system.validate_and_deploy(version_id, val_data, min_accuracy=0.45)
# Simulate live trading
print("\nSimulating live trading...")
lookback = 100 # Days of history needed
for i in range(lookback, len(test_data)):
# Get current data window
current_data = test_data.iloc[i-lookback:i+1]
current_price = test_data.iloc[i]['close']
# Process bar
result = system.process_bar(current_data, current_price)
# Update P&L if we have previous price
if i > lookback:
prev_price = test_data.iloc[i-1]['close']
price_return = (current_price - prev_price) / prev_price
system.update_pnl(price_return)
# Generate final report
print(system.generate_system_report())
# Visualize system performance
if system.equity_curve:
equity_df = pd.DataFrame(system.equity_curve)
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Equity curve
axes[0, 0].plot(range(len(equity_df)), equity_df['capital'])
axes[0, 0].set_xlabel('Time Step')
axes[0, 0].set_ylabel('Capital ($)')
axes[0, 0].set_title('Equity Curve')
# Position over time
axes[0, 1].step(range(len(equity_df)), equity_df['position'], where='post')
axes[0, 1].set_xlabel('Time Step')
axes[0, 1].set_ylabel('Position')
axes[0, 1].set_title('Position Over Time')
# Trade costs
if system.trade_history:
costs = [t['cost'] for t in system.trade_history]
axes[1, 0].bar(range(len(costs)), costs)
axes[1, 0].set_xlabel('Trade Number')
axes[1, 0].set_ylabel('Cost ($)')
axes[1, 0].set_title('Trade Costs')
# Drawdown
capitals = equity_df['capital'].values
peak = np.maximum.accumulate(capitals)
drawdown = (capitals - peak) / peak
axes[1, 1].fill_between(range(len(drawdown)), drawdown, 0, alpha=0.7, color='red')
axes[1, 1].set_xlabel('Time Step')
axes[1, 1].set_ylabel('Drawdown')
axes[1, 1].set_title('Drawdown')
plt.tight_layout()
plt.show()
Exercises
Complete the following exercises to practice production ML systems.
Exercise 13.1: Create Feature Definition (Guided)
Define a new feature for the pipeline.
Solution 13.1
def create_bollinger_band_feature():
feature = FeatureDefinition(
name='bb_width',
feature_type='volatility',
lookback_periods=20,
params={'period': 20, 'std_dev': 2}
)
return feature
Exercise 13.2: Implement Model Version Comparison (Guided)
Create a function to compare model versions.
Solution 13.2
def compare_model_versions(registry: ModelRegistry,
version_ids: List[str]) -> pd.DataFrame:
records = []
for version_id in version_ids:
# Get version from registry
if version_id in registry.versions:
version = registry.versions[version_id]
# Create record dict
record = {
'version_id': version.version_id,
'model_type': version.model_type,
'created_at': version.created_at,
'is_active': version.is_active
}
# Add all metrics
for metric_name, metric_value in version.metrics.items():
record[metric_name] = metric_value
records.append(record)
return pd.DataFrame(records)
Exercise 13.3: Implement Drift Detection (Guided)
Create a simple drift detection function.
Solution 13.3
def detect_distribution_drift(reference_data: np.ndarray,
current_data: np.ndarray,
threshold: float = 0.1) -> Dict:
# Calculate reference statistics
ref_mean = np.mean(reference_data)
ref_std = np.std(reference_data)
# Calculate current statistics
curr_mean = np.mean(current_data)
curr_std = np.std(current_data)
# Calculate drift metrics
mean_drift = abs(curr_mean - ref_mean) / (ref_std + 1e-10)
std_drift = abs(curr_std - ref_std) / (ref_std + 1e-10)
# Determine if drifted
is_drifted = mean_drift > threshold or std_drift > threshold
return {
'mean_drift': mean_drift,
'std_drift': std_drift,
'is_drifted': is_drifted,
'ref_mean': ref_mean,
'curr_mean': curr_mean
}
Exercise 13.4: Build Feature Store (Open-ended)
Create a simple feature store for caching computed features.
Solution 13.4
class FeatureStore:
def __init__(self, ttl_hours: int = 24):
self.cache = {} # {date: {feature_name: value}}
self.timestamps = {} # {date: cache_time}
self.ttl = timedelta(hours=ttl_hours)
def put(self, date: datetime, features: Dict[str, float]):
"""Store features for a date."""
date_key = date.date() if isinstance(date, datetime) else date
self.cache[date_key] = features
self.timestamps[date_key] = datetime.now()
def get(self, date: datetime, feature_names: List[str] = None) -> Optional[Dict]:
"""Get features for a date (point-in-time lookup)."""
date_key = date.date() if isinstance(date, datetime) else date
if date_key not in self.cache:
return None
# Check TTL
if datetime.now() - self.timestamps[date_key] > self.ttl:
del self.cache[date_key]
del self.timestamps[date_key]
return None
features = self.cache[date_key]
if feature_names:
return {k: v for k, v in features.items() if k in feature_names}
return features
def get_range(self, start_date: datetime, end_date: datetime) -> pd.DataFrame:
"""Get features for a date range."""
records = []
current = start_date
while current <= end_date:
features = self.get(current)
if features:
records.append({'date': current, **features})
current += timedelta(days=1)
return pd.DataFrame(records)
def save(self, filepath: str):
"""Save feature store to disk."""
with open(filepath, 'wb') as f:
pickle.dump({'cache': self.cache, 'timestamps': self.timestamps}, f)
@classmethod
def load(cls, filepath: str) -> 'FeatureStore':
"""Load feature store from disk."""
with open(filepath, 'rb') as f:
data = pickle.load(f)
store = cls()
store.cache = data['cache']
store.timestamps = data['timestamps']
return store
def cleanup_expired(self):
"""Remove expired entries."""
now = datetime.now()
expired = [k for k, v in self.timestamps.items() if now - v > self.ttl]
for k in expired:
del self.cache[k]
del self.timestamps[k]
return len(expired)
# Usage
store = FeatureStore(ttl_hours=24)
store.put(datetime.now(), {'return_1d': 0.01, 'rsi': 55})
print(store.get(datetime.now()))
Exercise 13.5: Implement A/B Testing Framework (Open-ended)
Create a framework for A/B testing model versions.
Solution 13.5
class ABTestManager:
def __init__(self, model_a, model_b, split_ratio=0.5):
self.model_a = model_a
self.model_b = model_b
self.split_ratio = split_ratio
self.results_a = []
self.results_b = []
def predict(self, X):
"""Route prediction to A or B based on split."""
if np.random.random() < self.split_ratio:
return 'A', self.model_a.predict(X)
else:
return 'B', self.model_b.predict(X)
def record_outcome(self, model_id: str, prediction: int, actual: int):
"""Record prediction outcome."""
is_correct = prediction == actual
if model_id == 'A':
self.results_a.append(is_correct)
else:
self.results_b.append(is_correct)
def get_performance(self) -> Dict:
"""Get performance comparison."""
acc_a = np.mean(self.results_a) if self.results_a else 0
acc_b = np.mean(self.results_b) if self.results_b else 0
return {
'model_a': {'accuracy': acc_a, 'n_samples': len(self.results_a)},
'model_b': {'accuracy': acc_b, 'n_samples': len(self.results_b)}
}
def is_significant(self, confidence=0.95) -> Dict:
"""Check if difference is statistically significant."""
if len(self.results_a) < 30 or len(self.results_b) < 30:
return {'significant': False, 'reason': 'insufficient_samples'}
# Two-proportion z-test
p_a = np.mean(self.results_a)
p_b = np.mean(self.results_b)
n_a = len(self.results_a)
n_b = len(self.results_b)
p_pooled = (p_a * n_a + p_b * n_b) / (n_a + n_b)
se = np.sqrt(p_pooled * (1 - p_pooled) * (1/n_a + 1/n_b))
z = (p_a - p_b) / (se + 1e-10)
# Critical value for 95% confidence
z_critical = 1.96
return {
'significant': abs(z) > z_critical,
'z_score': z,
'winner': 'A' if z > z_critical else ('B' if z < -z_critical else 'tie')
}
def get_recommendation(self) -> str:
"""Get deployment recommendation."""
perf = self.get_performance()
sig = self.is_significant()
if not sig['significant']:
return "Continue testing - no significant difference yet"
winner = sig['winner']
return f"Deploy Model {winner} - statistically significant improvement"
# Usage
# ab_test = ABTestManager(model_v1, model_v2)
Exercise 13.6: Create Automated Retraining Pipeline (Open-ended)
Build an automated pipeline that retrains models when performance degrades.
Solution 13.6
class AutoRetrainer:
def __init__(self, deployment_manager: ModelDeploymentManager,
monitor: ModelMonitor,
accuracy_threshold: float = 0.48,
retrain_window: int = 252):
self.deployment_manager = deployment_manager
self.monitor = monitor
self.accuracy_threshold = accuracy_threshold
self.retrain_window = retrain_window
self.retrain_history = []
self.last_retrain = None
self.min_retrain_interval = timedelta(days=7)
def check_retrain_needed(self) -> bool:
"""Check if retraining is needed."""
current_accuracy = self.monitor.calculate_rolling_accuracy(50)
if current_accuracy is None:
return False
# Check if enough time since last retrain
if self.last_retrain:
if datetime.now() - self.last_retrain < self.min_retrain_interval:
return False
return current_accuracy < self.accuracy_threshold
def retrain(self, recent_data: pd.DataFrame,
model_class=RandomForestClassifier,
hyperparameters: Dict = None) -> Optional[str]:
"""Retrain model on recent data."""
if not self.check_retrain_needed():
return None
if hyperparameters is None:
hyperparameters = {'n_estimators': 100, 'max_depth': 5, 'random_state': 42}
# Use recent data for training
train_data = recent_data.iloc[-self.retrain_window:]
# Train new version
version_id = self.deployment_manager.train_new_version(
train_data, model_class, hyperparameters
)
# Log retraining
self.retrain_history.append({
'timestamp': datetime.now(),
'version_id': version_id,
'trigger_accuracy': self.monitor.calculate_rolling_accuracy(50)
})
self.last_retrain = datetime.now()
return version_id
def auto_deploy(self, version_id: str,
validation_data: pd.DataFrame,
min_improvement: float = 0.02) -> bool:
"""Automatically deploy if new model is better."""
# Validate new model
is_valid = self.deployment_manager.validate_before_deploy(
version_id, validation_data
)
if not is_valid:
return False
# Check if improvement is significant
current_accuracy = self.monitor.calculate_rolling_accuracy(50) or 0
new_accuracy = self.deployment_manager.model_registry.versions[version_id].metrics.get('val_accuracy', 0)
if new_accuracy - current_accuracy >= min_improvement:
self.deployment_manager.deploy(version_id)
return True
return False
def run_auto_retrain_cycle(self, recent_data: pd.DataFrame,
validation_data: pd.DataFrame) -> Dict:
"""Run complete auto-retrain cycle."""
result = {
'retrain_needed': self.check_retrain_needed(),
'retrained': False,
'deployed': False
}
if result['retrain_needed']:
version_id = self.retrain(recent_data)
if version_id:
result['retrained'] = True
result['version_id'] = version_id
result['deployed'] = self.auto_deploy(version_id, validation_data)
return result
# Usage
# auto_retrainer = AutoRetrainer(deployment_manager, monitor)
# result = auto_retrainer.run_auto_retrain_cycle(recent_data, val_data)
Summary
In this module, you learned:
-
Feature Pipelines: Building robust, versioned feature computation systems
-
Model Versioning: Managing model versions with proper metadata and rollback capability
-
Model Monitoring: Detecting drift, degradation, and anomalies in production
-
Error Handling: Building resilient systems with fallbacks and alerts
-
Production Systems: Integrating all components into a complete trading system
Key Takeaways
- Feature pipelines must be consistent between training and inference
- Model versioning enables reproducibility and safe rollbacks
- Continuous monitoring catches problems before they cause losses
- Fallback mechanisms ensure the system degrades gracefully
- Production systems require more engineering than research systems
Next Steps
In Module 14, you'll explore advanced ML topics including reinforcement learning, online learning, and ensemble methods for finance.
Module 14: Advanced ML Topics
Overview
This module explores cutting-edge ML techniques for finance including reinforcement learning, online learning, and advanced ensemble methods. These techniques address unique challenges in financial markets.
Learning Objectives
By the end of this module, you will be able to: - Apply reinforcement learning to portfolio optimization - Implement online learning for adapting to market changes - Build advanced ensemble methods for improved predictions - Understand meta-learning approaches for finance
Prerequisites
- Module 11: Deep Learning for Finance
- Module 12: Backtesting ML Strategies
- Module 13: Production ML Systems
Estimated Time: 4 hours
Section 1: Reinforcement Learning for Trading
Reinforcement learning (RL) frames trading as a sequential decision problem where an agent learns to maximize cumulative rewards.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from typing import Dict, List, Tuple, Any, Optional
from dataclasses import dataclass
from collections import deque
import random
from abc import ABC, abstractmethod
import warnings
warnings.filterwarnings('ignore')
np.random.seed(42)
print("Advanced ML libraries loaded")
# Trading Environment for RL
class TradingEnvironment:
"""RL environment for trading."""
def __init__(self, df: pd.DataFrame, initial_capital: float = 100000,
commission: float = 0.001, window_size: int = 20):
self.df = df.reset_index(drop=True)
self.initial_capital = initial_capital
self.commission = commission
self.window_size = window_size
# State variables
self.current_step = None
self.capital = None
self.position = None # -1, 0, 1
self.entry_price = None
# Actions: 0=hold, 1=buy, 2=sell
self.action_space = 3
def reset(self) -> np.ndarray:
"""Reset environment to initial state."""
self.current_step = self.window_size
self.capital = self.initial_capital
self.position = 0
self.entry_price = None
return self._get_state()
def _get_state(self) -> np.ndarray:
"""Get current state observation."""
# Price-based features
window = self.df.iloc[self.current_step - self.window_size:self.current_step]
# Normalized returns
returns = window['close'].pct_change().fillna(0).values
# Position encoding
position_encoding = np.array([self.position])
# PnL if in position
if self.position != 0 and self.entry_price:
unrealized_pnl = (self.df.iloc[self.current_step]['close'] - self.entry_price) / self.entry_price
unrealized_pnl = np.array([unrealized_pnl * self.position])
else:
unrealized_pnl = np.array([0.0])
state = np.concatenate([returns, position_encoding, unrealized_pnl])
return state.astype(np.float32)
def step(self, action: int) -> Tuple[np.ndarray, float, bool, Dict]:
"""Take action and return next state, reward, done, info."""
current_price = self.df.iloc[self.current_step]['close']
reward = 0
# Execute action
if action == 1: # Buy
if self.position <= 0:
# Close short if any
if self.position == -1:
pnl = (self.entry_price - current_price) / self.entry_price
self.capital *= (1 + pnl - self.commission)
# Open long
self.position = 1
self.entry_price = current_price
self.capital *= (1 - self.commission)
elif action == 2: # Sell
if self.position >= 0:
# Close long if any
if self.position == 1:
pnl = (current_price - self.entry_price) / self.entry_price
self.capital *= (1 + pnl - self.commission)
# Open short
self.position = -1
self.entry_price = current_price
self.capital *= (1 - self.commission)
# Move to next step
self.current_step += 1
# Calculate reward (daily return of position)
if self.current_step < len(self.df):
next_price = self.df.iloc[self.current_step]['close']
price_return = (next_price - current_price) / current_price
reward = self.position * price_return
# Check if done
done = self.current_step >= len(self.df) - 1
# Get next state
next_state = self._get_state() if not done else np.zeros(self.window_size + 2)
info = {
'capital': self.capital,
'position': self.position,
'step': self.current_step
}
return next_state, reward, done, info
@property
def state_size(self) -> int:
"""Get state dimension."""
return self.window_size + 2 # returns + position + unrealized pnl
print("TradingEnvironment class defined")
# Q-Learning Agent
class QLearningAgent:
"""Simple Q-learning agent for trading."""
def __init__(self, state_size: int, action_size: int,
learning_rate: float = 0.01,
gamma: float = 0.95,
epsilon: float = 1.0,
epsilon_decay: float = 0.995,
epsilon_min: float = 0.01):
self.state_size = state_size
self.action_size = action_size
self.learning_rate = learning_rate
self.gamma = gamma
self.epsilon = epsilon
self.epsilon_decay = epsilon_decay
self.epsilon_min = epsilon_min
# Q-table (discretized)
self.n_bins = 10
self.q_table = {}
def _discretize_state(self, state: np.ndarray) -> tuple:
"""Discretize continuous state."""
# Clip and bin the state
clipped = np.clip(state, -1, 1)
binned = np.digitize(clipped, np.linspace(-1, 1, self.n_bins))
return tuple(binned)
def get_q_values(self, state: np.ndarray) -> np.ndarray:
"""Get Q-values for a state."""
discrete_state = self._discretize_state(state)
if discrete_state not in self.q_table:
self.q_table[discrete_state] = np.zeros(self.action_size)
return self.q_table[discrete_state]
def choose_action(self, state: np.ndarray, training: bool = True) -> int:
"""Choose action using epsilon-greedy policy."""
if training and np.random.random() < self.epsilon:
return np.random.randint(self.action_size)
q_values = self.get_q_values(state)
return np.argmax(q_values)
def learn(self, state: np.ndarray, action: int,
reward: float, next_state: np.ndarray, done: bool):
"""Update Q-values."""
discrete_state = self._discretize_state(state)
discrete_next_state = self._discretize_state(next_state)
# Initialize if needed
if discrete_state not in self.q_table:
self.q_table[discrete_state] = np.zeros(self.action_size)
if discrete_next_state not in self.q_table:
self.q_table[discrete_next_state] = np.zeros(self.action_size)
# Q-learning update
current_q = self.q_table[discrete_state][action]
if done:
target_q = reward
else:
target_q = reward + self.gamma * np.max(self.q_table[discrete_next_state])
self.q_table[discrete_state][action] += self.learning_rate * (target_q - current_q)
def decay_epsilon(self):
"""Decay exploration rate."""
self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)
print("QLearningAgent class defined")
# Generate data and train RL agent
def generate_trading_data(n_samples=1000):
np.random.seed(42)
returns = np.random.normal(0.0003, 0.015, n_samples)
prices = 100 * np.exp(np.cumsum(returns))
return pd.DataFrame({
'close': prices,
'volume': np.random.lognormal(15, 0.5, n_samples)
})
df = generate_trading_data(1000)
# Create environment and agent
env = TradingEnvironment(df, window_size=10)
agent = QLearningAgent(
state_size=env.state_size,
action_size=env.action_space
)
# Training loop
n_episodes = 100
episode_rewards = []
print("Training RL agent...")
for episode in range(n_episodes):
state = env.reset()
total_reward = 0
done = False
while not done:
action = agent.choose_action(state, training=True)
next_state, reward, done, info = env.step(action)
agent.learn(state, action, reward, next_state, done)
state = next_state
total_reward += reward
agent.decay_epsilon()
episode_rewards.append(total_reward)
if (episode + 1) % 20 == 0:
avg_reward = np.mean(episode_rewards[-20:])
print(f"Episode {episode + 1}: Avg Reward = {avg_reward:.4f}, Epsilon = {agent.epsilon:.4f}")
# Evaluate trained agent
state = env.reset()
done = False
capitals = [env.initial_capital]
positions = [0]
while not done:
action = agent.choose_action(state, training=False)
next_state, reward, done, info = env.step(action)
capitals.append(info['capital'])
positions.append(info['position'])
state = next_state
# Visualize results
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Training rewards
axes[0, 0].plot(episode_rewards)
axes[0, 0].set_xlabel('Episode')
axes[0, 0].set_ylabel('Total Reward')
axes[0, 0].set_title('Training Progress')
# Equity curve
axes[0, 1].plot(capitals)
axes[0, 1].set_xlabel('Step')
axes[0, 1].set_ylabel('Capital ($)')
axes[0, 1].set_title('RL Agent Equity Curve')
# Position over time
axes[1, 0].step(range(len(positions)), positions, where='post')
axes[1, 0].set_xlabel('Step')
axes[1, 0].set_ylabel('Position')
axes[1, 0].set_title('Positions Over Time')
# Compare with buy and hold
buy_hold = env.initial_capital * (df['close'] / df['close'].iloc[0])
axes[1, 1].plot(range(len(capitals)), capitals, label='RL Agent')
axes[1, 1].plot(range(len(buy_hold)), buy_hold.values, label='Buy & Hold', alpha=0.7)
axes[1, 1].set_xlabel('Step')
axes[1, 1].set_ylabel('Capital ($)')
axes[1, 1].set_title('RL vs Buy & Hold')
axes[1, 1].legend()
plt.tight_layout()
plt.show()
print(f"\nFinal Capital: ${capitals[-1]:,.2f}")
print(f"Total Return: {(capitals[-1] / capitals[0] - 1):.2%}")
Section 2: Online Learning
Online learning allows models to adapt continuously to new data without full retraining.
# Online Learning Framework
class OnlineLearner(ABC):
"""Abstract base class for online learning."""
@abstractmethod
def partial_fit(self, X: np.ndarray, y: np.ndarray):
pass
@abstractmethod
def predict(self, X: np.ndarray) -> np.ndarray:
pass
class OnlineSGDClassifier(OnlineLearner):
"""Online SGD classifier with adaptive learning."""
def __init__(self, n_features: int, learning_rate: float = 0.01,
l2_reg: float = 0.001):
self.n_features = n_features
self.learning_rate = learning_rate
self.l2_reg = l2_reg
# Initialize weights
self.weights = np.zeros(n_features)
self.bias = 0.0
# Adaptive learning rate (AdaGrad)
self.grad_squared = np.zeros(n_features)
self.bias_grad_squared = 0.0
self.n_samples_seen = 0
def _sigmoid(self, z: np.ndarray) -> np.ndarray:
"""Sigmoid activation."""
return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
def partial_fit(self, X: np.ndarray, y: np.ndarray):
"""Update model with new samples."""
X = np.atleast_2d(X)
y = np.atleast_1d(y)
for xi, yi in zip(X, y):
# Forward pass
z = np.dot(xi, self.weights) + self.bias
pred = self._sigmoid(z)
# Gradient
error = pred - yi
grad_w = error * xi + self.l2_reg * self.weights
grad_b = error
# AdaGrad update
self.grad_squared += grad_w ** 2
self.bias_grad_squared += grad_b ** 2
adj_lr_w = self.learning_rate / (np.sqrt(self.grad_squared) + 1e-8)
adj_lr_b = self.learning_rate / (np.sqrt(self.bias_grad_squared) + 1e-8)
# Update weights
self.weights -= adj_lr_w * grad_w
self.bias -= adj_lr_b * grad_b
self.n_samples_seen += 1
def predict_proba(self, X: np.ndarray) -> np.ndarray:
"""Predict probabilities."""
X = np.atleast_2d(X)
z = np.dot(X, self.weights) + self.bias
return self._sigmoid(z)
def predict(self, X: np.ndarray) -> np.ndarray:
"""Predict class labels."""
return (self.predict_proba(X) >= 0.5).astype(int)
print("OnlineSGDClassifier defined")
# Online Learning with Concept Drift Detection
class DriftDetector:
"""Detect concept drift in streaming data."""
def __init__(self, window_size: int = 100, threshold: float = 0.1):
self.window_size = window_size
self.threshold = threshold
self.error_window = deque(maxlen=window_size)
self.baseline_error = None
def update(self, is_error: bool):
"""Update with new prediction result."""
self.error_window.append(1 if is_error else 0)
if len(self.error_window) == self.window_size:
if self.baseline_error is None:
self.baseline_error = np.mean(self.error_window)
def is_drift(self) -> bool:
"""Check if drift has occurred."""
if self.baseline_error is None or len(self.error_window) < self.window_size:
return False
current_error = np.mean(self.error_window)
return (current_error - self.baseline_error) > self.threshold
def reset_baseline(self):
"""Reset baseline after handling drift."""
if len(self.error_window) >= self.window_size:
self.baseline_error = np.mean(self.error_window)
class AdaptiveOnlineLearner:
"""Online learner with drift detection and adaptation."""
def __init__(self, n_features: int, learning_rate: float = 0.01):
self.model = OnlineSGDClassifier(n_features, learning_rate)
self.drift_detector = DriftDetector(window_size=50, threshold=0.15)
self.predictions = []
self.actuals = []
self.drift_points = []
def predict_and_update(self, X: np.ndarray, y: int) -> int:
"""Make prediction and update model."""
# Predict
pred = self.model.predict(X.reshape(1, -1))[0]
# Record
self.predictions.append(pred)
self.actuals.append(y)
# Check for drift
is_error = pred != y
self.drift_detector.update(is_error)
if self.drift_detector.is_drift():
self.drift_points.append(len(self.predictions))
# Increase learning rate temporarily
self.model.learning_rate *= 2
self.drift_detector.reset_baseline()
# Update model
self.model.partial_fit(X.reshape(1, -1), np.array([y]))
# Decay learning rate
self.model.learning_rate = max(0.001, self.model.learning_rate * 0.999)
return pred
def get_rolling_accuracy(self, window: int = 50) -> float:
"""Get rolling accuracy."""
if len(self.predictions) < window:
return None
recent_pred = self.predictions[-window:]
recent_actual = self.actuals[-window:]
return np.mean(np.array(recent_pred) == np.array(recent_actual))
print("AdaptiveOnlineLearner defined")
# Test online learning with simulated concept drift
def generate_drift_data(n_samples=1000, drift_point=500):
"""Generate data with concept drift."""
np.random.seed(42)
X = np.random.randn(n_samples, 5)
# Before drift: y depends on X[:, 0] + X[:, 1]
y_before = ((X[:drift_point, 0] + X[:drift_point, 1]) > 0).astype(int)
# After drift: y depends on X[:, 2] - X[:, 3]
y_after = ((X[drift_point:, 2] - X[drift_point:, 3]) > 0).astype(int)
y = np.concatenate([y_before, y_after])
return X, y
X, y = generate_drift_data(1000, drift_point=500)
# Train adaptive online learner
online_learner = AdaptiveOnlineLearner(n_features=5, learning_rate=0.1)
rolling_accuracies = []
print("Training adaptive online learner...")
for i in range(len(X)):
online_learner.predict_and_update(X[i], y[i])
if i >= 50:
acc = online_learner.get_rolling_accuracy(50)
rolling_accuracies.append(acc)
print(f"\nDrift detected at points: {online_learner.drift_points}")
print(f"Final rolling accuracy: {rolling_accuracies[-1]:.4f}")
# Visualize online learning
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Rolling accuracy
axes[0].plot(range(50, len(X)), rolling_accuracies)
axes[0].axvline(x=500, color='red', linestyle='--', label='True Drift Point')
for dp in online_learner.drift_points:
axes[0].axvline(x=dp, color='green', linestyle=':', alpha=0.7)
axes[0].axhline(y=0.5, color='gray', linestyle='--', alpha=0.5)
axes[0].set_xlabel('Sample')
axes[0].set_ylabel('Rolling Accuracy (50)')
axes[0].set_title('Online Learning Accuracy with Concept Drift')
axes[0].legend()
# Cumulative accuracy
cumulative_correct = np.cumsum(np.array(online_learner.predictions) == np.array(online_learner.actuals))
cumulative_accuracy = cumulative_correct / np.arange(1, len(cumulative_correct) + 1)
axes[1].plot(cumulative_accuracy)
axes[1].axvline(x=500, color='red', linestyle='--', label='Drift Point')
axes[1].set_xlabel('Sample')
axes[1].set_ylabel('Cumulative Accuracy')
axes[1].set_title('Cumulative Accuracy Over Time')
axes[1].legend()
plt.tight_layout()
plt.show()
Section 3: Advanced Ensemble Methods
Advanced ensembles go beyond simple averaging to dynamically weight models based on recent performance.
# Dynamic Weighted Ensemble
class DynamicEnsemble:
"""Ensemble that dynamically adjusts weights based on performance."""
def __init__(self, models: List, weight_decay: float = 0.95,
min_weight: float = 0.1):
self.models = models
self.n_models = len(models)
self.weight_decay = weight_decay
self.min_weight = min_weight
# Initialize equal weights
self.weights = np.ones(self.n_models) / self.n_models
# Track performance
self.model_correct = np.zeros(self.n_models)
self.model_total = np.zeros(self.n_models)
def predict(self, X: np.ndarray) -> np.ndarray:
"""Make weighted ensemble prediction."""
predictions = np.array([m.predict(X) for m in self.models])
# Weighted voting
weighted_sum = np.dot(self.weights, predictions)
return (weighted_sum >= 0.5).astype(int)
def predict_proba(self, X: np.ndarray) -> np.ndarray:
"""Get weighted probability."""
probas = np.array([m.predict_proba(X)[:, 1] if hasattr(m, 'predict_proba')
else m.predict(X) for m in self.models])
return np.dot(self.weights, probas)
def update_weights(self, X: np.ndarray, y_true: int):
"""Update weights based on individual model performance."""
# Get individual predictions
predictions = np.array([m.predict(X.reshape(1, -1))[0] for m in self.models])
# Update performance tracking
correct = predictions == y_true
self.model_correct = self.weight_decay * self.model_correct + correct
self.model_total = self.weight_decay * self.model_total + 1
# Calculate new weights based on accuracy
accuracies = self.model_correct / (self.model_total + 1e-8)
# Ensure minimum weight
accuracies = np.maximum(accuracies, self.min_weight)
# Normalize weights
self.weights = accuracies / accuracies.sum()
def get_model_weights(self) -> Dict[int, float]:
"""Get current model weights."""
return {i: w for i, w in enumerate(self.weights)}
print("DynamicEnsemble defined")
# Stacking Ensemble with Meta-Learner
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_predict
from sklearn.preprocessing import StandardScaler
class StackingEnsemble:
"""Stacking ensemble with meta-learner."""
def __init__(self, base_models: List, meta_model=None):
self.base_models = base_models
self.meta_model = meta_model or LogisticRegression()
self.scaler = StandardScaler()
self.fitted = False
def fit(self, X: np.ndarray, y: np.ndarray):
"""Fit stacking ensemble."""
# Scale features
X_scaled = self.scaler.fit_transform(X)
# Generate base model predictions using cross-validation
meta_features = np.zeros((len(X), len(self.base_models)))
for i, model in enumerate(self.base_models):
# Get out-of-fold predictions
meta_features[:, i] = cross_val_predict(
model, X_scaled, y, cv=5, method='predict'
)
# Fit on full data
model.fit(X_scaled, y)
# Fit meta-learner
self.meta_model.fit(meta_features, y)
self.fitted = True
def predict(self, X: np.ndarray) -> np.ndarray:
"""Make stacked prediction."""
if not self.fitted:
raise ValueError("Model not fitted")
X_scaled = self.scaler.transform(X)
# Get base model predictions
meta_features = np.column_stack([
model.predict(X_scaled) for model in self.base_models
])
# Meta-learner prediction
return self.meta_model.predict(meta_features)
def predict_proba(self, X: np.ndarray) -> np.ndarray:
"""Get probability predictions."""
if not self.fitted:
raise ValueError("Model not fitted")
X_scaled = self.scaler.transform(X)
meta_features = np.column_stack([
model.predict(X_scaled) for model in self.base_models
])
return self.meta_model.predict_proba(meta_features)
print("StackingEnsemble defined")
# Test stacking ensemble
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Generate data
X_class, y_class = make_classification(n_samples=1000, n_features=20,
n_informative=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
X_class, y_class, test_size=0.2, random_state=42
)
# Create base models
base_models = [
RandomForestClassifier(n_estimators=50, max_depth=5, random_state=42),
GradientBoostingClassifier(n_estimators=50, max_depth=3, random_state=42),
LogisticRegression(random_state=42)
]
# Train stacking ensemble
stacking = StackingEnsemble(base_models)
stacking.fit(X_train, y_train)
# Evaluate
stacking_pred = stacking.predict(X_test)
stacking_acc = accuracy_score(y_test, stacking_pred)
# Compare with individual models
print("Model Comparison:")
print(f"Stacking Ensemble: {stacking_acc:.4f}")
for i, model in enumerate(base_models):
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
model.fit(X_train_scaled, y_train)
pred = model.predict(X_test_scaled)
acc = accuracy_score(y_test, pred)
print(f"Model {i} ({type(model).__name__}): {acc:.4f}")
Section 4: Meta-Learning for Finance
Meta-learning aims to learn how to learn, enabling quick adaptation to new market regimes.
# Model Selection Meta-Learner
class ModelSelector:
"""Meta-learner that selects best model based on market conditions."""
def __init__(self, models: Dict[str, Any]):
self.models = models
self.model_performance = {name: deque(maxlen=50) for name in models}
self.regime_detector = None
self.regime_model_map = {} # Maps regime to best model
def detect_regime(self, features: np.ndarray) -> str:
"""Detect current market regime."""
# Simple regime detection based on volatility and trend
volatility = np.std(features[-20:, 0]) if len(features) >= 20 else 0.01
trend = np.mean(features[-10:, 0]) if len(features) >= 10 else 0
if volatility > 0.02:
return 'high_volatility'
elif trend > 0.001:
return 'bullish'
elif trend < -0.001:
return 'bearish'
else:
return 'sideways'
def select_model(self, features: np.ndarray) -> str:
"""Select best model for current conditions."""
regime = self.detect_regime(features)
# Check if we have learned which model works best for this regime
if regime in self.regime_model_map:
return self.regime_model_map[regime]
# Otherwise, select model with best recent performance
best_model = None
best_accuracy = -1
for name, perf in self.model_performance.items():
if len(perf) > 0:
acc = np.mean(perf)
if acc > best_accuracy:
best_accuracy = acc
best_model = name
return best_model or list(self.models.keys())[0]
def predict(self, X: np.ndarray, features_history: np.ndarray) -> np.ndarray:
"""Make prediction using selected model."""
model_name = self.select_model(features_history)
return self.models[model_name].predict(X)
def update_performance(self, model_name: str, is_correct: bool,
features: np.ndarray):
"""Update model performance tracking."""
self.model_performance[model_name].append(1 if is_correct else 0)
# Update regime-model mapping
regime = self.detect_regime(features)
# Find best model for this regime
best_acc = -1
best_model = None
for name, perf in self.model_performance.items():
if len(perf) >= 10:
acc = np.mean(list(perf)[-10:])
if acc > best_acc:
best_acc = acc
best_model = name
if best_model:
self.regime_model_map[regime] = best_model
print("ModelSelector defined")
# Regime-Specific Model Training
class RegimeAdaptiveSystem:
"""System that adapts model based on detected regime."""
def __init__(self):
self.scaler = StandardScaler()
self.regime_models = {
'high_volatility': RandomForestClassifier(n_estimators=50, max_depth=3, random_state=42),
'bullish': GradientBoostingClassifier(n_estimators=50, max_depth=3, random_state=42),
'bearish': GradientBoostingClassifier(n_estimators=50, max_depth=3, random_state=42),
'sideways': LogisticRegression(random_state=42)
}
self.regime_data = {regime: {'X': [], 'y': []} for regime in self.regime_models}
self.min_samples = 50
def detect_regime(self, df: pd.DataFrame) -> str:
"""Detect market regime from price data."""
if len(df) < 20:
return 'sideways'
returns = df['close'].pct_change().dropna()
volatility = returns.std()
trend = returns.mean()
if volatility > 0.02:
return 'high_volatility'
elif trend > 0.001:
return 'bullish'
elif trend < -0.001:
return 'bearish'
else:
return 'sideways'
def add_sample(self, X: np.ndarray, y: int, regime: str):
"""Add training sample to regime-specific data."""
self.regime_data[regime]['X'].append(X)
self.regime_data[regime]['y'].append(y)
# Retrain if enough samples
if len(self.regime_data[regime]['X']) >= self.min_samples:
self._retrain_regime(regime)
def _retrain_regime(self, regime: str):
"""Retrain model for specific regime."""
X = np.array(self.regime_data[regime]['X'])
y = np.array(self.regime_data[regime]['y'])
X_scaled = self.scaler.fit_transform(X)
self.regime_models[regime].fit(X_scaled, y)
def predict(self, X: np.ndarray, df: pd.DataFrame) -> int:
"""Predict using regime-appropriate model."""
regime = self.detect_regime(df)
# Check if model is trained
if len(self.regime_data[regime]['X']) < self.min_samples:
# Use default model
regime = 'sideways'
X_scaled = self.scaler.transform(X.reshape(1, -1))
return self.regime_models[regime].predict(X_scaled)[0]
print("RegimeAdaptiveSystem defined")
Section 5: Module Project - Advanced Trading System
Build an advanced trading system combining RL, online learning, and ensemble methods.
# Advanced ML Trading System
class AdvancedMLTradingSystem:
"""Trading system combining multiple advanced ML techniques."""
def __init__(self, initial_capital: float = 100000):
self.initial_capital = initial_capital
self.capital = initial_capital
self.position = 0
# Components
self.scaler = StandardScaler()
self.online_learner = None
self.ensemble = None
self.regime_system = RegimeAdaptiveSystem()
# Tracking
self.equity_curve = []
self.signals = []
self.regime_history = []
def create_features(self, df: pd.DataFrame) -> np.ndarray:
"""Create features from price data."""
data = df.copy()
# Returns
data['return_1d'] = data['close'].pct_change()
data['return_5d'] = data['close'].pct_change(5)
# Volatility
data['volatility'] = data['return_1d'].rolling(20).std()
# RSI
delta = data['close'].diff()
gain = (delta.where(delta > 0, 0)).rolling(14).mean()
loss = (-delta.where(delta < 0, 0)).rolling(14).mean()
data['rsi'] = 100 - (100 / (1 + gain / (loss + 1e-10)))
# Price to SMA
data['price_to_sma'] = data['close'] / data['close'].rolling(20).mean()
feature_cols = ['return_1d', 'return_5d', 'volatility', 'rsi', 'price_to_sma']
return data[feature_cols].fillna(0).values
def initialize_models(self, train_data: pd.DataFrame):
"""Initialize all models with training data."""
features = self.create_features(train_data)
# Create target
target = (train_data['close'].shift(-1) > train_data['close']).astype(int).values
# Remove NaN
valid_idx = ~np.isnan(features).any(axis=1)
features = features[valid_idx]
target = target[valid_idx][:-1] # Remove last (no target)
features = features[:-1]
# Scale features
self.scaler.fit(features)
features_scaled = self.scaler.transform(features)
# Initialize online learner
self.online_learner = OnlineSGDClassifier(n_features=features.shape[1])
for i in range(min(100, len(features))):
self.online_learner.partial_fit(
features_scaled[i:i+1], target[i:i+1]
)
# Initialize ensemble
base_models = [
RandomForestClassifier(n_estimators=50, max_depth=3, random_state=42),
GradientBoostingClassifier(n_estimators=50, max_depth=3, random_state=42),
LogisticRegression(random_state=42)
]
for model in base_models:
model.fit(features_scaled, target)
self.ensemble = DynamicEnsemble(base_models)
print("Models initialized")
def generate_signal(self, current_data: pd.DataFrame) -> int:
"""Generate trading signal using ensemble of methods."""
features = self.create_features(current_data)
X = features[-1:]
if np.isnan(X).any():
return 0
X_scaled = self.scaler.transform(X)
# Get predictions from different methods
online_pred = self.online_learner.predict(X_scaled)[0]
ensemble_pred = self.ensemble.predict(X_scaled)[0]
regime_pred = self.regime_system.predict(X_scaled[0], current_data)
# Combine predictions (majority voting)
votes = [online_pred, ensemble_pred, regime_pred]
final_pred = 1 if sum(votes) >= 2 else 0
return 1 if final_pred == 1 else -1
def update_models(self, X: np.ndarray, y: int, df: pd.DataFrame):
"""Update models with new observation."""
X_scaled = self.scaler.transform(X.reshape(1, -1))
# Update online learner
self.online_learner.partial_fit(X_scaled, np.array([y]))
# Update ensemble weights
self.ensemble.update_weights(X_scaled[0], y)
# Update regime system
regime = self.regime_system.detect_regime(df)
self.regime_system.add_sample(X_scaled[0], y, regime)
def trade(self, signal: int, price: float):
"""Execute trade based on signal."""
if signal != self.position:
# Close existing position
# Open new position
cost = abs(signal - self.position) * self.capital * 0.001
self.capital -= cost
self.position = signal
def update_pnl(self, price_return: float):
"""Update P&L based on position."""
pnl = self.capital * self.position * price_return
self.capital += pnl
def run_backtest(self, data: pd.DataFrame, lookback: int = 100):
"""Run backtest on historical data."""
# Initialize with first portion
self.initialize_models(data.iloc[:lookback])
self.equity_curve = [self.initial_capital]
for i in range(lookback, len(data) - 1):
current_data = data.iloc[i-lookback:i+1]
# Generate signal
signal = self.generate_signal(current_data)
self.signals.append(signal)
# Record regime
regime = self.regime_system.detect_regime(current_data)
self.regime_history.append(regime)
# Trade
current_price = data.iloc[i]['close']
self.trade(signal, current_price)
# Update P&L
next_price = data.iloc[i+1]['close']
price_return = (next_price - current_price) / current_price
self.update_pnl(price_return)
self.equity_curve.append(self.capital)
# Update models with actual outcome
actual = 1 if next_price > current_price else 0
features = self.create_features(current_data)
if not np.isnan(features[-1]).any():
self.update_models(features[-1], actual, current_data)
return self.get_results()
def get_results(self) -> Dict:
"""Get backtest results."""
capitals = np.array(self.equity_curve)
returns = np.diff(capitals) / capitals[:-1]
return {
'total_return': (self.capital / self.initial_capital) - 1,
'sharpe_ratio': np.sqrt(252) * np.mean(returns) / (np.std(returns) + 1e-8),
'max_drawdown': (capitals / np.maximum.accumulate(capitals) - 1).min(),
'equity_curve': self.equity_curve,
'signals': self.signals,
'regime_history': self.regime_history
}
print("AdvancedMLTradingSystem defined")
# Run advanced system backtest
# Generate longer dataset
full_data = generate_trading_data(2000)
# Initialize and run system
advanced_system = AdvancedMLTradingSystem(initial_capital=100000)
results = advanced_system.run_backtest(full_data, lookback=100)
print("\n" + "="*50)
print("Advanced ML Trading System Results")
print("="*50)
print(f"Total Return: {results['total_return']:.2%}")
print(f"Sharpe Ratio: {results['sharpe_ratio']:.2f}")
print(f"Max Drawdown: {results['max_drawdown']:.2%}")
# Visualize advanced system results
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Equity curve
axes[0, 0].plot(results['equity_curve'])
buy_hold = 100000 * (full_data['close'] / full_data['close'].iloc[0])
axes[0, 0].plot(buy_hold.iloc[100:].values, alpha=0.7, label='Buy & Hold')
axes[0, 0].set_xlabel('Step')
axes[0, 0].set_ylabel('Capital ($)')
axes[0, 0].set_title('Advanced System vs Buy & Hold')
axes[0, 0].legend(['Strategy', 'Buy & Hold'])
# Drawdown
capitals = np.array(results['equity_curve'])
peak = np.maximum.accumulate(capitals)
drawdown = (capitals - peak) / peak
axes[0, 1].fill_between(range(len(drawdown)), drawdown, 0, alpha=0.7, color='red')
axes[0, 1].set_xlabel('Step')
axes[0, 1].set_ylabel('Drawdown')
axes[0, 1].set_title('Drawdown')
# Regime distribution
from collections import Counter
regime_counts = Counter(results['regime_history'])
axes[1, 0].bar(regime_counts.keys(), regime_counts.values())
axes[1, 0].set_xlabel('Regime')
axes[1, 0].set_ylabel('Count')
axes[1, 0].set_title('Regime Distribution')
axes[1, 0].tick_params(axis='x', rotation=45)
# Signal distribution
signal_counts = Counter(results['signals'])
axes[1, 1].bar(['Short (-1)', 'Long (1)'],
[signal_counts.get(-1, 0), signal_counts.get(1, 0)])
axes[1, 1].set_xlabel('Signal')
axes[1, 1].set_ylabel('Count')
axes[1, 1].set_title('Signal Distribution')
plt.tight_layout()
plt.show()
Exercises
Complete the following exercises to practice advanced ML techniques.
Exercise 14.1: Implement Reward Shaping (Guided)
Create a custom reward function for RL trading.
Solution 14.1
def calculate_shaped_reward(position: int, price_return: float,
volatility: float, drawdown: float) -> float:
# Calculate base return reward
return_reward = position * price_return
# Calculate risk-adjusted reward
risk_adjusted = return_reward / (volatility + 1e-8)
# Calculate drawdown penalty
drawdown_penalty = -0.5 * abs(drawdown) if drawdown < -0.05 else 0
# Combine rewards
total_reward = risk_adjusted + drawdown_penalty
return total_reward
Exercise 14.2: Implement Online Weight Update (Guided)
Create a function to update ensemble weights online.
Solution 14.2
def update_ensemble_weights(weights: np.ndarray, predictions: np.ndarray,
actual: int, learning_rate: float = 0.1) -> np.ndarray:
# Calculate which models were correct
correct = (predictions == actual).astype(float)
# Calculate reward/penalty for each model
rewards = np.where(correct == 1, learning_rate, -learning_rate)
# Update weights using exponential update
new_weights = weights * np.exp(rewards)
# Normalize weights to sum to 1
new_weights = new_weights / new_weights.sum()
return new_weights
Exercise 14.3: Implement Regime Detection (Guided)
Create a market regime detector.
Solution 14.3
def detect_market_regime(prices: np.ndarray, window: int = 20) -> str:
if len(prices) < window:
return 'unknown'
# Calculate returns
returns = np.diff(prices) / prices[:-1]
recent_returns = returns[-window:]
# Calculate trend (average return)
trend = np.mean(recent_returns)
# Calculate volatility
volatility = np.std(recent_returns)
# Classify regime
if volatility > 0.02:
return 'high_volatility'
elif trend > 0.001:
return 'trending_up'
elif trend < -0.001:
return 'trending_down'
else:
return 'low_volatility'
Exercise 14.4: Build Experience Replay Buffer (Open-ended)
Create an experience replay buffer for deep RL.
Solution 14.4
class ReplayBuffer:
def __init__(self, capacity: int = 10000):
self.capacity = capacity
self.buffer = deque(maxlen=capacity)
self.priorities = deque(maxlen=capacity)
def push(self, state, action, reward, next_state, done, priority=1.0):
"""Add experience to buffer."""
experience = (state, action, reward, next_state, done)
self.buffer.append(experience)
self.priorities.append(priority)
def sample(self, batch_size: int) -> List[Tuple]:
"""Sample random batch."""
indices = np.random.choice(len(self.buffer),
size=min(batch_size, len(self.buffer)),
replace=False)
return [self.buffer[i] for i in indices]
def sample_prioritized(self, batch_size: int, alpha: float = 0.6) -> List[Tuple]:
"""Sample with priority weighting."""
priorities = np.array(self.priorities) ** alpha
probs = priorities / priorities.sum()
indices = np.random.choice(len(self.buffer),
size=min(batch_size, len(self.buffer)),
p=probs,
replace=False)
return [self.buffer[i] for i in indices]
def update_priority(self, index: int, priority: float):
"""Update priority for an experience."""
if index < len(self.priorities):
self.priorities[index] = priority
def __len__(self):
return len(self.buffer)
# Usage
buffer = ReplayBuffer(capacity=10000)
buffer.push(np.zeros(5), 1, 0.1, np.zeros(5), False)
print(f"Buffer size: {len(buffer)}")
Exercise 14.5: Implement Bandit-Based Model Selection (Open-ended)
Create a multi-armed bandit for dynamic model selection.
Solution 14.5
class ModelBandit:
def __init__(self, n_models: int, exploration_param: float = 2.0):
self.n_models = n_models
self.exploration_param = exploration_param
# Track performance
self.n_selections = np.zeros(n_models)
self.total_rewards = np.zeros(n_models)
self.total_rounds = 0
def select_model(self) -> int:
"""Select model using UCB algorithm."""
self.total_rounds += 1
# Ensure each model is tried at least once
for i in range(self.n_models):
if self.n_selections[i] == 0:
return i
# Calculate UCB scores
avg_rewards = self.total_rewards / self.n_selections
exploration_bonus = np.sqrt(
self.exploration_param * np.log(self.total_rounds) / self.n_selections
)
ucb_scores = avg_rewards + exploration_bonus
return np.argmax(ucb_scores)
def update(self, model_idx: int, reward: float):
"""Update model statistics."""
self.n_selections[model_idx] += 1
self.total_rewards[model_idx] += reward
def get_best_model(self) -> int:
"""Get best performing model."""
avg_rewards = self.total_rewards / (self.n_selections + 1e-8)
return np.argmax(avg_rewards)
def get_model_stats(self) -> pd.DataFrame:
"""Get statistics for all models."""
return pd.DataFrame({
'model': range(self.n_models),
'n_selections': self.n_selections,
'total_rewards': self.total_rewards,
'avg_reward': self.total_rewards / (self.n_selections + 1e-8)
})
# Usage
bandit = ModelBandit(n_models=3)
for _ in range(100):
model_idx = bandit.select_model()
reward = np.random.random() # Simulated reward
bandit.update(model_idx, reward)
print(bandit.get_model_stats())
Exercise 14.6: Create Adaptive Learning Rate Scheduler (Open-ended)
Build a learning rate scheduler that adapts to market conditions.
Solution 14.6
class MarketAdaptiveLRScheduler:
def __init__(self, initial_lr: float = 0.01,
min_lr: float = 0.0001,
max_lr: float = 0.1,
warmup_steps: int = 100):
self.initial_lr = initial_lr
self.min_lr = min_lr
self.max_lr = max_lr
self.warmup_steps = warmup_steps
self.current_lr = initial_lr
self.step_count = 0
self.volatility_history = deque(maxlen=50)
self.baseline_volatility = None
def step(self, volatility: float) -> float:
"""Update and return learning rate."""
self.step_count += 1
self.volatility_history.append(volatility)
# Warmup phase
if self.step_count <= self.warmup_steps:
warmup_factor = self.step_count / self.warmup_steps
self.current_lr = self.initial_lr * warmup_factor
return self.current_lr
# Set baseline after warmup
if self.baseline_volatility is None:
self.baseline_volatility = np.mean(self.volatility_history)
# Adapt based on volatility
current_vol = np.mean(list(self.volatility_history)[-10:])
vol_ratio = current_vol / (self.baseline_volatility + 1e-8)
# Higher volatility -> higher learning rate (faster adaptation)
if vol_ratio > 1.5: # High volatility
self.current_lr = min(self.current_lr * 1.1, self.max_lr)
elif vol_ratio < 0.7: # Low volatility
self.current_lr = max(self.current_lr * 0.95, self.min_lr)
return self.current_lr
def get_lr(self) -> float:
return self.current_lr
def reset(self):
self.current_lr = self.initial_lr
self.step_count = 0
self.volatility_history.clear()
self.baseline_volatility = None
# Usage
scheduler = MarketAdaptiveLRScheduler()
for i in range(200):
vol = 0.01 if i < 100 else 0.03 # Volatility spike at step 100
lr = scheduler.step(vol)
if i % 50 == 0:
print(f"Step {i}: LR = {lr:.6f}")
Summary
In this module, you learned:
-
Reinforcement Learning: Using RL to learn trading policies through interaction with market environments
-
Online Learning: Continuously adapting models to new data with drift detection
-
Advanced Ensembles: Dynamic weighting and stacking for improved predictions
-
Meta-Learning: Learning which models work best in different market conditions
-
Production Systems: Combining multiple techniques into robust trading systems
Key Takeaways
- RL can discover trading strategies that supervised learning might miss
- Online learning enables continuous adaptation without full retraining
- Dynamic ensembles outperform static ensembles in changing markets
- Meta-learning helps select the right model for current conditions
- Combining multiple techniques provides the most robust results
Course Completion
Congratulations! You have completed the Machine Learning for Financial Markets course. You now have a comprehensive understanding of applying ML techniques to trading and investment problems.
Capstone Project: End-to-End ML Trading System
Project Overview
In this capstone project, you will build a complete machine learning trading system from scratch. This project integrates all concepts from the course including data preprocessing, feature engineering, model selection, backtesting, and production deployment.
Learning Objectives
By completing this project, you will demonstrate: - Comprehensive feature engineering for financial data - Multiple ML model implementation and comparison - Proper walk-forward backtesting methodology - Production-ready system architecture - Performance analysis and risk management
Project Requirements
Build a trading system that: 1. Processes raw market data into ML-ready features 2. Trains and evaluates multiple model types 3. Implements proper walk-forward validation 4. Includes realistic transaction costs 5. Provides comprehensive performance analysis 6. Is designed for production deployment
Estimated Time: 6-8 hours
Part 1: Setup and Data Generation
Set up the project environment and generate realistic market data.
# Import all required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
from typing import Dict, List, Tuple, Any, Optional
from dataclasses import dataclass, field
from collections import deque
from abc import ABC, abstractmethod
import warnings
warnings.filterwarnings('ignore')
# ML libraries
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report
np.random.seed(42)
print("All libraries loaded successfully")
print(f"Project started: {datetime.now()}")
# Generate comprehensive market data
def generate_market_data(n_days=2500, n_assets=3):
"""
Generate realistic multi-asset market data with:
- Regime switching
- Volatility clustering
- Cross-asset correlations
"""
np.random.seed(42)
dates = pd.date_range(start='2015-01-01', periods=n_days, freq='D')
# Create regime series
regime = np.zeros(n_days)
current_regime = 0
for i in range(n_days):
if np.random.random() < 0.005: # 0.5% chance to switch
current_regime = 1 - current_regime
regime[i] = current_regime
assets = {}
asset_names = ['STOCK_A', 'STOCK_B', 'STOCK_C'][:n_assets]
# Correlation matrix
correlation = np.array([
[1.0, 0.6, 0.3],
[0.6, 1.0, 0.4],
[0.3, 0.4, 1.0]
])[:n_assets, :n_assets]
# Generate correlated returns
L = np.linalg.cholesky(correlation)
uncorr_returns = np.random.normal(0, 1, (n_days, n_assets))
corr_returns = uncorr_returns @ L.T
for idx, asset in enumerate(asset_names):
# Base parameters vary by regime
base_return = np.where(regime == 0, 0.0004, -0.0001)
base_vol = np.where(regime == 0, 0.012, 0.020)
# Apply volatility clustering
volatility = np.zeros(n_days)
volatility[0] = base_vol[0]
for i in range(1, n_days):
volatility[i] = 0.9 * volatility[i-1] + 0.1 * base_vol[i]
# Generate returns
returns = base_return + volatility * corr_returns[:, idx]
# Generate prices
prices = 100 * np.exp(np.cumsum(returns))
# Create OHLCV
daily_range = volatility * np.random.uniform(0.5, 1.5, n_days)
assets[asset] = pd.DataFrame({
'date': dates,
'open': np.roll(prices, 1),
'high': prices * (1 + daily_range),
'low': prices * (1 - daily_range),
'close': prices,
'volume': np.random.lognormal(15, 0.5, n_days) * (1 + regime * 0.5),
'regime': regime
})
assets[asset].loc[0, 'open'] = assets[asset].loc[0, 'close']
assets[asset].set_index('date', inplace=True)
return assets
# Generate data
market_data = generate_market_data(n_days=2500, n_assets=3)
print(f"Generated data for {len(market_data)} assets")
for asset, df in market_data.items():
print(f" {asset}: {len(df)} days from {df.index[0].date()} to {df.index[-1].date()}")
# Visualize market data
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Price series
for asset, df in market_data.items():
axes[0, 0].plot(df.index, df['close'], label=asset)
axes[0, 0].set_xlabel('Date')
axes[0, 0].set_ylabel('Price')
axes[0, 0].set_title('Asset Prices')
axes[0, 0].legend()
# Returns distribution
for asset, df in market_data.items():
returns = df['close'].pct_change().dropna()
axes[0, 1].hist(returns, bins=50, alpha=0.5, label=asset)
axes[0, 1].set_xlabel('Daily Return')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].set_title('Return Distributions')
axes[0, 1].legend()
# Regime over time (first asset)
first_asset = list(market_data.keys())[0]
axes[1, 0].fill_between(market_data[first_asset].index,
market_data[first_asset]['regime'],
alpha=0.5)
axes[1, 0].set_xlabel('Date')
axes[1, 0].set_ylabel('Regime')
axes[1, 0].set_title('Market Regime (0=Bull, 1=Bear)')
# Rolling volatility
for asset, df in market_data.items():
returns = df['close'].pct_change()
rolling_vol = returns.rolling(20).std() * np.sqrt(252)
axes[1, 1].plot(df.index, rolling_vol, label=asset)
axes[1, 1].set_xlabel('Date')
axes[1, 1].set_ylabel('Annualized Volatility')
axes[1, 1].set_title('Rolling 20-Day Volatility')
axes[1, 1].legend()
plt.tight_layout()
plt.show()
Part 2: Feature Engineering Pipeline
Build a comprehensive feature engineering pipeline.
# Feature Engineering Class
class FeatureEngineer:
"""Comprehensive feature engineering for trading."""
def __init__(self):
self.feature_names = []
def create_price_features(self, df: pd.DataFrame) -> pd.DataFrame:
"""Create price-based features."""
data = df.copy()
# Returns at multiple horizons
for period in [1, 2, 3, 5, 10, 20]:
data[f'return_{period}d'] = data['close'].pct_change(period)
# Log returns
data['log_return_1d'] = np.log(data['close'] / data['close'].shift(1))
# Price momentum
for period in [5, 10, 20, 50]:
data[f'momentum_{period}d'] = data['close'] / data['close'].shift(period) - 1
return data
def create_volatility_features(self, df: pd.DataFrame) -> pd.DataFrame:
"""Create volatility-based features."""
data = df.copy()
returns = data['close'].pct_change()
# Rolling volatility
for period in [5, 10, 20, 50]:
data[f'volatility_{period}d'] = returns.rolling(period).std()
# Volatility ratio
data['volatility_ratio'] = data['volatility_5d'] / (data['volatility_20d'] + 1e-10)
# Parkinson volatility (using high/low)
data['parkinson_vol'] = np.sqrt(
(1 / (4 * np.log(2))) *
(np.log(data['high'] / data['low']) ** 2).rolling(20).mean()
)
# Average True Range
high_low = data['high'] - data['low']
high_close = abs(data['high'] - data['close'].shift(1))
low_close = abs(data['low'] - data['close'].shift(1))
tr = pd.concat([high_low, high_close, low_close], axis=1).max(axis=1)
data['atr_14'] = tr.rolling(14).mean()
data['atr_normalized'] = data['atr_14'] / data['close']
return data
def create_technical_features(self, df: pd.DataFrame) -> pd.DataFrame:
"""Create technical indicator features."""
data = df.copy()
# Moving averages
for period in [5, 10, 20, 50, 200]:
data[f'sma_{period}'] = data['close'].rolling(period).mean()
data[f'ema_{period}'] = data['close'].ewm(span=period).mean()
data[f'price_to_sma_{period}'] = data['close'] / data[f'sma_{period}']
# MACD
exp12 = data['close'].ewm(span=12).mean()
exp26 = data['close'].ewm(span=26).mean()
data['macd'] = exp12 - exp26
data['macd_signal'] = data['macd'].ewm(span=9).mean()
data['macd_hist'] = data['macd'] - data['macd_signal']
# RSI
delta = data['close'].diff()
gain = (delta.where(delta > 0, 0)).rolling(14).mean()
loss = (-delta.where(delta < 0, 0)).rolling(14).mean()
rs = gain / (loss + 1e-10)
data['rsi_14'] = 100 - (100 / (1 + rs))
# Stochastic Oscillator
low_14 = data['low'].rolling(14).min()
high_14 = data['high'].rolling(14).max()
data['stoch_k'] = 100 * (data['close'] - low_14) / (high_14 - low_14 + 1e-10)
data['stoch_d'] = data['stoch_k'].rolling(3).mean()
# Bollinger Bands
bb_sma = data['close'].rolling(20).mean()
bb_std = data['close'].rolling(20).std()
data['bb_upper'] = bb_sma + 2 * bb_std
data['bb_lower'] = bb_sma - 2 * bb_std
data['bb_width'] = (data['bb_upper'] - data['bb_lower']) / bb_sma
data['bb_position'] = (data['close'] - data['bb_lower']) / (data['bb_upper'] - data['bb_lower'] + 1e-10)
return data
def create_volume_features(self, df: pd.DataFrame) -> pd.DataFrame:
"""Create volume-based features."""
data = df.copy()
# Volume moving averages
for period in [5, 10, 20]:
data[f'volume_sma_{period}'] = data['volume'].rolling(period).mean()
# Volume ratio
data['volume_ratio'] = data['volume'] / data['volume_sma_20']
# On-Balance Volume (OBV)
obv = np.where(data['close'] > data['close'].shift(1), data['volume'],
np.where(data['close'] < data['close'].shift(1), -data['volume'], 0))
data['obv'] = np.cumsum(obv)
data['obv_sma'] = pd.Series(data['obv']).rolling(20).mean().values
# Volume-Price Trend
data['vpt'] = (data['volume'] * data['close'].pct_change()).cumsum()
return data
def create_all_features(self, df: pd.DataFrame) -> pd.DataFrame:
"""Create all features."""
data = df.copy()
data = self.create_price_features(data)
data = self.create_volatility_features(data)
data = self.create_technical_features(data)
data = self.create_volume_features(data)
# Create target (next day direction)
data['target'] = (data['close'].shift(-1) > data['close']).astype(int)
data['target_return'] = data['close'].pct_change().shift(-1)
# Store feature names
exclude_cols = ['open', 'high', 'low', 'close', 'volume', 'regime',
'target', 'target_return'] + \
[c for c in data.columns if 'sma_' in c and 'price_to' not in c] + \
[c for c in data.columns if 'ema_' in c] + \
['bb_upper', 'bb_lower', 'volume_sma_5', 'volume_sma_10',
'volume_sma_20', 'obv', 'obv_sma', 'vpt']
self.feature_names = [c for c in data.columns if c not in exclude_cols]
return data
def get_feature_names(self) -> List[str]:
"""Get list of feature names."""
return self.feature_names
# Create features for primary asset
feature_engineer = FeatureEngineer()
primary_asset = 'STOCK_A'
df_features = feature_engineer.create_all_features(market_data[primary_asset])
print(f"\nCreated {len(feature_engineer.get_feature_names())} features:")
for i, name in enumerate(feature_engineer.get_feature_names()):
print(f" {i+1}. {name}")
# TODO: Complete the feature correlation analysis
# Analyze feature correlations and select the most important features
def analyze_feature_importance(df: pd.DataFrame, feature_names: List[str],
target_col: str = 'target') -> pd.DataFrame:
"""
Analyze feature importance and correlation with target.
Returns DataFrame with:
- Feature correlations with target
- Feature correlations with each other (to detect multicollinearity)
"""
# YOUR CODE HERE
# 1. Calculate correlation of each feature with target
# 2. Identify highly correlated feature pairs
# 3. Return sorted importance scores
valid_data = df[feature_names + [target_col]].dropna()
# Calculate correlations with target
target_corr = valid_data[feature_names].corrwith(valid_data[target_col]).abs()
# Create importance DataFrame
importance = pd.DataFrame({
'feature': feature_names,
'target_correlation': target_corr.values
}).sort_values('target_correlation', ascending=False)
return importance
# Analyze features
feature_importance = analyze_feature_importance(
df_features, feature_engineer.get_feature_names()
)
print("\nTop 15 Features by Target Correlation:")
print(feature_importance.head(15).to_string(index=False))
Part 3: Model Training and Selection
Train multiple models and select the best one.
# Model Training Framework
class ModelTrainer:
"""Framework for training and comparing models."""
def __init__(self, feature_names: List[str]):
self.feature_names = feature_names
self.scaler = StandardScaler()
self.models = {}
self.results = {}
def prepare_data(self, df: pd.DataFrame, target_col: str = 'target'):
"""Prepare data for training."""
# Get features and target
valid_mask = ~df[self.feature_names + [target_col]].isna().any(axis=1)
valid_data = df[valid_mask].copy()
X = valid_data[self.feature_names].values
y = valid_data[target_col].values
dates = valid_data.index
return X, y, dates
def train_test_split(self, X: np.ndarray, y: np.ndarray,
dates: pd.DatetimeIndex, train_ratio: float = 0.7):
"""Time-based train/test split."""
split_idx = int(len(X) * train_ratio)
X_train, X_test = X[:split_idx], X[split_idx:]
y_train, y_test = y[:split_idx], y[split_idx:]
dates_train, dates_test = dates[:split_idx], dates[split_idx:]
# Scale features
X_train_scaled = self.scaler.fit_transform(X_train)
X_test_scaled = self.scaler.transform(X_test)
return (X_train_scaled, X_test_scaled, y_train, y_test,
dates_train, dates_test)
def train_model(self, name: str, model, X_train: np.ndarray,
y_train: np.ndarray):
"""Train a model."""
model.fit(X_train, y_train)
self.models[name] = model
return model
def evaluate_model(self, name: str, X_test: np.ndarray,
y_test: np.ndarray) -> Dict:
"""Evaluate a trained model."""
model = self.models[name]
predictions = model.predict(X_test)
results = {
'accuracy': accuracy_score(y_test, predictions),
'precision': precision_score(y_test, predictions, zero_division=0),
'recall': recall_score(y_test, predictions, zero_division=0),
'f1': f1_score(y_test, predictions, zero_division=0),
'predictions': predictions
}
self.results[name] = results
return results
def compare_models(self) -> pd.DataFrame:
"""Compare all trained models."""
comparison = []
for name, results in self.results.items():
comparison.append({
'model': name,
'accuracy': results['accuracy'],
'precision': results['precision'],
'recall': results['recall'],
'f1': results['f1']
})
return pd.DataFrame(comparison).sort_values('f1', ascending=False)
print("ModelTrainer class defined")
# TODO: Train and compare multiple models
# Select top features
top_features = feature_importance.head(20)['feature'].tolist()
# Initialize trainer
trainer = ModelTrainer(top_features)
# Prepare data
X, y, dates = trainer.prepare_data(df_features)
# Split data
(X_train, X_test, y_train, y_test,
dates_train, dates_test) = trainer.train_test_split(X, y, dates)
print(f"Training set: {len(X_train)} samples")
print(f"Test set: {len(X_test)} samples")
# Define models to train
models_to_train = {
'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
'Random Forest': RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42),
'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, max_depth=3, random_state=42)
}
# Train and evaluate models
print("\nTraining models...")
for name, model in models_to_train.items():
print(f" Training {name}...")
trainer.train_model(name, model, X_train, y_train)
results = trainer.evaluate_model(name, X_test, y_test)
print(f" Accuracy: {results['accuracy']:.4f}, F1: {results['f1']:.4f}")
# Compare models
print("\nModel Comparison:")
comparison = trainer.compare_models()
print(comparison.to_string(index=False))
Part 4: Walk-Forward Backtesting
Implement proper walk-forward validation.
# Walk-Forward Backtester
class WalkForwardBacktester:
"""Walk-forward backtesting with realistic assumptions."""
def __init__(self, model, feature_names: List[str],
train_window: int = 252, test_window: int = 21,
step_size: int = 21, commission: float = 0.001):
self.model = model
self.feature_names = feature_names
self.train_window = train_window
self.test_window = test_window
self.step_size = step_size
self.commission = commission
self.scaler = StandardScaler()
self.fold_results = []
self.all_predictions = None
def run(self, df: pd.DataFrame) -> pd.DataFrame:
"""Run walk-forward backtest."""
# Prepare data
valid_mask = ~df[self.feature_names + ['target']].isna().any(axis=1)
data = df[valid_mask].copy()
X = data[self.feature_names].values
y = data['target'].values
n_samples = len(X)
predictions = np.full(n_samples, np.nan)
probabilities = np.full(n_samples, np.nan)
start_idx = self.train_window
fold = 0
while start_idx + self.test_window <= n_samples:
# Define windows
train_start = max(0, start_idx - self.train_window)
train_end = start_idx
test_start = start_idx
test_end = min(start_idx + self.test_window, n_samples)
# Get data
X_train = X[train_start:train_end]
y_train = y[train_start:train_end]
X_test = X[test_start:test_end]
y_test = y[test_start:test_end]
# Scale
X_train_scaled = self.scaler.fit_transform(X_train)
X_test_scaled = self.scaler.transform(X_test)
# Train
self.model.fit(X_train_scaled, y_train)
# Predict
pred = self.model.predict(X_test_scaled)
prob = self.model.predict_proba(X_test_scaled)[:, 1]
predictions[test_start:test_end] = pred
probabilities[test_start:test_end] = prob
# Record fold results
self.fold_results.append({
'fold': fold,
'train_start': data.index[train_start],
'test_start': data.index[test_start],
'test_end': data.index[test_end-1],
'accuracy': accuracy_score(y_test, pred)
})
fold += 1
start_idx += self.step_size
# Store results
results = data.copy()
results['prediction'] = predictions
results['probability'] = probabilities
results['signal'] = np.where(predictions == 1, 1, -1)
self.all_predictions = results
return results
def calculate_backtest_metrics(self, initial_capital: float = 100000) -> Dict:
"""Calculate backtest performance metrics."""
if self.all_predictions is None:
raise ValueError("Run backtest first")
results = self.all_predictions.dropna(subset=['signal']).copy()
# Calculate strategy returns
position = results['signal'].values
returns = results['target_return'].values
# Account for position changes (transaction costs)
position_changes = np.abs(np.diff(np.concatenate([[0], position])))
costs = position_changes * self.commission
# Strategy returns
strategy_returns = position * returns - costs
# Calculate metrics
cumulative_returns = (1 + strategy_returns).cumprod()
total_return = cumulative_returns[-1] - 1 if len(cumulative_returns) > 0 else 0
sharpe = np.sqrt(252) * np.mean(strategy_returns) / (np.std(strategy_returns) + 1e-8)
# Max drawdown
peak = np.maximum.accumulate(cumulative_returns)
drawdown = (cumulative_returns - peak) / peak
max_drawdown = np.min(drawdown)
# Win rate
win_rate = np.mean(strategy_returns > 0)
# Trade statistics
n_trades = np.sum(position_changes > 0)
total_costs = np.sum(costs) * initial_capital
return {
'total_return': total_return,
'sharpe_ratio': sharpe,
'max_drawdown': max_drawdown,
'win_rate': win_rate,
'n_trades': n_trades,
'total_costs': total_costs,
'avg_fold_accuracy': np.mean([f['accuracy'] for f in self.fold_results]),
'cumulative_returns': cumulative_returns,
'strategy_returns': strategy_returns
}
print("WalkForwardBacktester class defined")
# TODO: Run walk-forward backtest
# Select best model from comparison
best_model_name = comparison.iloc[0]['model']
print(f"Using best model: {best_model_name}")
# Create fresh model instance
if 'Random Forest' in best_model_name:
backtest_model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
elif 'Gradient Boosting' in best_model_name:
backtest_model = GradientBoostingClassifier(n_estimators=100, max_depth=3, random_state=42)
else:
backtest_model = LogisticRegression(random_state=42, max_iter=1000)
# Initialize backtester
backtester = WalkForwardBacktester(
model=backtest_model,
feature_names=top_features,
train_window=252,
test_window=21,
step_size=21,
commission=0.001
)
# Run backtest
print("\nRunning walk-forward backtest...")
backtest_results = backtester.run(df_features)
# Calculate metrics
metrics = backtester.calculate_backtest_metrics(initial_capital=100000)
print("\n" + "="*50)
print("WALK-FORWARD BACKTEST RESULTS")
print("="*50)
print(f"Total Return: {metrics['total_return']:.2%}")
print(f"Sharpe Ratio: {metrics['sharpe_ratio']:.2f}")
print(f"Max Drawdown: {metrics['max_drawdown']:.2%}")
print(f"Win Rate: {metrics['win_rate']:.2%}")
print(f"Number of Trades: {metrics['n_trades']:.0f}")
print(f"Total Costs: ${metrics['total_costs']:,.2f}")
print(f"Average Fold Accuracy: {metrics['avg_fold_accuracy']:.4f}")
# Visualize backtest results
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Cumulative returns
valid_results = backtest_results.dropna(subset=['signal'])
strategy_cum = metrics['cumulative_returns']
buy_hold_cum = (1 + valid_results['target_return']).cumprod()
axes[0, 0].plot(range(len(strategy_cum)), strategy_cum, label='Strategy', linewidth=2)
axes[0, 0].plot(range(len(buy_hold_cum)), buy_hold_cum.values, label='Buy & Hold', alpha=0.7)
axes[0, 0].set_xlabel('Trading Days')
axes[0, 0].set_ylabel('Cumulative Return')
axes[0, 0].set_title('Strategy vs Buy & Hold')
axes[0, 0].legend()
# Drawdown
peak = np.maximum.accumulate(strategy_cum)
drawdown = (strategy_cum - peak) / peak
axes[0, 1].fill_between(range(len(drawdown)), drawdown, 0, alpha=0.7, color='red')
axes[0, 1].set_xlabel('Trading Days')
axes[0, 1].set_ylabel('Drawdown')
axes[0, 1].set_title('Strategy Drawdown')
# Fold accuracy over time
fold_df = pd.DataFrame(backtester.fold_results)
axes[1, 0].bar(range(len(fold_df)), fold_df['accuracy'])
axes[1, 0].axhline(y=0.5, color='red', linestyle='--', label='Random')
axes[1, 0].axhline(y=fold_df['accuracy'].mean(), color='green', linestyle='--', label='Average')
axes[1, 0].set_xlabel('Fold')
axes[1, 0].set_ylabel('Accuracy')
axes[1, 0].set_title('Walk-Forward Accuracy by Fold')
axes[1, 0].legend()
# Monthly returns heatmap
strategy_returns = pd.Series(metrics['strategy_returns'], index=valid_results.index)
monthly = strategy_returns.resample('M').sum()
colors = ['green' if r > 0 else 'red' for r in monthly.values]
axes[1, 1].bar(range(len(monthly)), monthly.values, color=colors, alpha=0.7)
axes[1, 1].set_xlabel('Month')
axes[1, 1].set_ylabel('Monthly Return')
axes[1, 1].set_title('Monthly Returns')
plt.tight_layout()
plt.show()
Part 5: Production System Design
Design a production-ready system architecture.
# TODO: Build a complete production trading system
class ProductionTradingSystem:
"""
Complete production ML trading system.
This system should include:
- Feature pipeline with versioning
- Model registry with version control
- Real-time prediction service
- Performance monitoring
- Alert system
"""
def __init__(self, model, feature_names: List[str],
initial_capital: float = 100000):
# Your implementation here
self.model = model
self.feature_names = feature_names
self.initial_capital = initial_capital
self.capital = initial_capital
self.position = 0
self.scaler = StandardScaler()
self.feature_engineer = FeatureEngineer()
# Tracking
self.predictions_log = []
self.trades_log = []
self.equity_curve = [initial_capital]
self.alerts = []
# Performance monitoring
self.rolling_accuracy = deque(maxlen=50)
def fit(self, train_data: pd.DataFrame):
"""Fit the system on training data."""
# Create features
df_features = self.feature_engineer.create_all_features(train_data)
# Prepare data
valid_mask = ~df_features[self.feature_names + ['target']].isna().any(axis=1)
valid_data = df_features[valid_mask]
X = valid_data[self.feature_names].values
y = valid_data['target'].values
# Fit scaler and model
X_scaled = self.scaler.fit_transform(X)
self.model.fit(X_scaled, y)
print(f"System fitted on {len(X)} samples")
def predict(self, current_data: pd.DataFrame) -> Dict:
"""Generate prediction for current data."""
try:
# Create features
df_features = self.feature_engineer.create_all_features(current_data)
# Get latest features
X = df_features[self.feature_names].iloc[-1:].values
if np.isnan(X).any():
return {'status': 'error', 'message': 'NaN in features'}
# Scale and predict
X_scaled = self.scaler.transform(X)
prediction = self.model.predict(X_scaled)[0]
probability = self.model.predict_proba(X_scaled)[0, 1]
signal = 1 if prediction == 1 else -1
result = {
'status': 'success',
'prediction': int(prediction),
'probability': float(probability),
'signal': signal,
'timestamp': datetime.now()
}
# Log prediction
self.predictions_log.append(result)
return result
except Exception as e:
return {'status': 'error', 'message': str(e)}
def update_with_actual(self, actual: int):
"""Update system with actual outcome."""
if self.predictions_log:
last_pred = self.predictions_log[-1]['prediction']
is_correct = last_pred == actual
self.rolling_accuracy.append(1 if is_correct else 0)
# Check for performance degradation
if len(self.rolling_accuracy) >= 20:
acc = np.mean(self.rolling_accuracy)
if acc < 0.45:
self.alerts.append({
'timestamp': datetime.now(),
'type': 'performance_degradation',
'message': f'Rolling accuracy dropped to {acc:.2%}'
})
def trade(self, signal: int, price: float):
"""Execute trade based on signal."""
if signal != self.position:
# Calculate cost
cost = abs(signal - self.position) * self.capital * 0.001
self.capital -= cost
trade = {
'timestamp': datetime.now(),
'old_position': self.position,
'new_position': signal,
'price': price,
'cost': cost
}
self.trades_log.append(trade)
self.position = signal
def update_pnl(self, price_return: float):
"""Update P&L based on position."""
pnl = self.capital * self.position * price_return
self.capital += pnl
self.equity_curve.append(self.capital)
def get_status(self) -> Dict:
"""Get system status."""
return {
'capital': self.capital,
'position': self.position,
'total_return': (self.capital / self.initial_capital) - 1,
'n_predictions': len(self.predictions_log),
'n_trades': len(self.trades_log),
'rolling_accuracy': np.mean(self.rolling_accuracy) if self.rolling_accuracy else None,
'n_alerts': len(self.alerts)
}
print("ProductionTradingSystem class defined")
# Test production system
# Split data for production simulation
train_data = market_data[primary_asset].iloc[:1500]
test_data = market_data[primary_asset].iloc[1500:]
# Initialize production system
prod_system = ProductionTradingSystem(
model=RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42),
feature_names=top_features,
initial_capital=100000
)
# Fit on training data
prod_system.fit(train_data)
# Simulate live trading
print("\nSimulating live trading...")
lookback = 100
for i in range(lookback, len(test_data) - 1):
# Get current data window
current_data = test_data.iloc[i-lookback:i+1]
current_price = test_data.iloc[i]['close']
# Generate prediction
result = prod_system.predict(current_data)
if result['status'] == 'success':
# Trade
prod_system.trade(result['signal'], current_price)
# Update P&L
if i > lookback:
prev_price = test_data.iloc[i-1]['close']
price_return = (current_price - prev_price) / prev_price
prod_system.update_pnl(price_return)
# Update with actual
next_price = test_data.iloc[i+1]['close']
actual = 1 if next_price > current_price else 0
prod_system.update_with_actual(actual)
# Get final status
status = prod_system.get_status()
print("\n" + "="*50)
print("PRODUCTION SYSTEM STATUS")
print("="*50)
for key, value in status.items():
if 'return' in key or 'accuracy' in key:
print(f"{key}: {value:.2%}" if value else f"{key}: N/A")
elif 'capital' in key:
print(f"{key}: ${value:,.2f}")
else:
print(f"{key}: {value}")
Part 6: Final Analysis and Report
Generate comprehensive analysis and final report.
# Generate final comprehensive report
def generate_final_report(backtester: WalkForwardBacktester,
prod_system: ProductionTradingSystem,
model_comparison: pd.DataFrame) -> str:
"""Generate comprehensive project report."""
bt_metrics = backtester.calculate_backtest_metrics()
prod_status = prod_system.get_status()
report = f"""
{'='*60}
CAPSTONE PROJECT: END-TO-END ML TRADING SYSTEM
Final Report
{'='*60}
Generated: {datetime.now()}
{'='*60}
1. MODEL SELECTION
{'='*60}
{model_comparison.to_string(index=False)}
Best Model: {model_comparison.iloc[0]['model']}
{'='*60}
2. WALK-FORWARD BACKTEST RESULTS
{'='*60}
Total Return: {bt_metrics['total_return']:.2%}
Sharpe Ratio: {bt_metrics['sharpe_ratio']:.2f}
Max Drawdown: {bt_metrics['max_drawdown']:.2%}
Win Rate: {bt_metrics['win_rate']:.2%}
Number of Trades: {bt_metrics['n_trades']:.0f}
Total Transaction Costs: ${bt_metrics['total_costs']:,.2f}
Average Fold Accuracy: {bt_metrics['avg_fold_accuracy']:.4f}
{'='*60}
3. PRODUCTION SIMULATION RESULTS
{'='*60}
Final Capital: ${prod_status['capital']:,.2f}
Total Return: {prod_status['total_return']:.2%}
Total Predictions: {prod_status['n_predictions']}
Total Trades: {prod_status['n_trades']}
Rolling Accuracy: {prod_status['rolling_accuracy']:.2%}" if prod_status['rolling_accuracy'] else "N/A"
Alerts Generated: {prod_status['n_alerts']}
{'='*60}
4. KEY FINDINGS
{'='*60}
- The {model_comparison.iloc[0]['model']} model achieved the best F1 score
- Walk-forward validation shows realistic out-of-sample performance
- Transaction costs significantly impact overall returns
- System includes monitoring for performance degradation
{'='*60}
5. RECOMMENDATIONS
{'='*60}
- Consider additional features (sentiment, alternative data)
- Implement ensemble methods for more robust predictions
- Add regime detection for adaptive model selection
- Monitor for concept drift and retrain periodically
{'='*60}
END OF REPORT
{'='*60}
"""
return report
# Generate and print report
final_report = generate_final_report(backtester, prod_system, comparison)
print(final_report)
# Final visualization
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
# 1. Model comparison
x = range(len(comparison))
axes[0, 0].bar(x, comparison['f1'])
axes[0, 0].set_xticks(x)
axes[0, 0].set_xticklabels(comparison['model'], rotation=45, ha='right')
axes[0, 0].set_ylabel('F1 Score')
axes[0, 0].set_title('Model Comparison')
# 2. Backtest equity curve
axes[0, 1].plot(metrics['cumulative_returns'])
axes[0, 1].set_xlabel('Trading Days')
axes[0, 1].set_ylabel('Cumulative Return')
axes[0, 1].set_title('Backtest Equity Curve')
# 3. Production equity curve
axes[0, 2].plot(prod_system.equity_curve)
axes[0, 2].set_xlabel('Trading Days')
axes[0, 2].set_ylabel('Capital ($)')
axes[0, 2].set_title('Production Simulation')
# 4. Fold accuracy distribution
fold_accuracies = [f['accuracy'] for f in backtester.fold_results]
axes[1, 0].hist(fold_accuracies, bins=20, edgecolor='black')
axes[1, 0].axvline(x=0.5, color='red', linestyle='--')
axes[1, 0].set_xlabel('Accuracy')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('Walk-Forward Accuracy Distribution')
# 5. Rolling accuracy over time (production)
if prod_system.rolling_accuracy:
rolling_acc = list(prod_system.rolling_accuracy)
cumulative_acc = np.cumsum(rolling_acc) / np.arange(1, len(rolling_acc) + 1)
axes[1, 1].plot(cumulative_acc)
axes[1, 1].axhline(y=0.5, color='red', linestyle='--')
axes[1, 1].set_xlabel('Prediction Number')
axes[1, 1].set_ylabel('Cumulative Accuracy')
axes[1, 1].set_title('Production Accuracy Over Time')
# 6. Trade distribution
if prod_system.trades_log:
positions = [t['new_position'] for t in prod_system.trades_log]
unique, counts = np.unique(positions, return_counts=True)
labels = ['Short' if p == -1 else 'Long' for p in unique]
axes[1, 2].bar(labels, counts)
axes[1, 2].set_ylabel('Count')
axes[1, 2].set_title('Trade Distribution')
plt.tight_layout()
plt.show()
print("\n" + "="*60)
print("CAPSTONE PROJECT COMPLETED")
print("="*60)
Project Summary
In this capstone project, you built a complete end-to-end ML trading system that includes:
-
Data Generation: Created realistic multi-asset market data with regime switching and volatility clustering
-
Feature Engineering: Built a comprehensive feature pipeline with price, volatility, technical, and volume features
-
Model Training: Trained and compared multiple ML models (Logistic Regression, Random Forest, Gradient Boosting)
-
Walk-Forward Backtesting: Implemented proper walk-forward validation with realistic transaction costs
-
Production System: Designed a production-ready system with prediction, trading, and monitoring capabilities
-
Performance Analysis: Generated comprehensive analysis and reporting
Key Takeaways
- Proper feature engineering is crucial for ML trading systems
- Walk-forward validation provides realistic performance estimates
- Transaction costs significantly impact strategy returns
- Production systems require monitoring and alert mechanisms
- Continuous improvement and adaptation are essential
Congratulations!
You have successfully completed the Machine Learning for Financial Markets course. You now have the skills to build, backtest, and deploy ML trading systems.